Parameter Efficient Diverse Paraphrase Generation Using Sequence-Level Knowledge Distillation

Lasal Jayawardena School of Computing
Informatics Institute of Technology
Colombo, Sri Lanka
[email protected] Prasan Yapa School of Computing
Informatics Institute of Technology
Colombo, Sri Lanka
[email protected]

Abstract

Over the past year, the field of Natural Language Generation (NLG) has experienced an exponential surge, largely due to the introduction of Large Language Models (LLMs). These models have exhibited the most effective performance in a range of domains within the Natural Language Processing and Generation domains. However, their application in domain-specific tasks, such as paraphrasing, presents significant challenges. The extensive number of parameters makes them difficult to operate on commercial hardware, and they require substantial time for inference, leading to high costs in a production setting. In this study, we tackle these obstacles by employing LLMs to develop three distinct models for the paraphrasing field, applying a method referred to as sequence-level knowledge distillation. These distilled models are capable of maintaining the quality of paraphrases generated by the LLM. They demonstrate faster inference times and the ability to generate diverse paraphrases of comparable quality. A notable characteristic of these models is their ability to exhibit syntactic diversity while also preserving lexical diversity, features previously uncommon due to existing data quality issues in datasets and not typically observed in neural-based approaches. Human evaluation of our models shows that there is only a 4% drop in performance compared to the LLM teacher model used in the distillation process, despite being 1000 times smaller. This research provides a significant contribution to the NLG field, offering a more efficient and cost-effective solution for paraphrasing tasks.

Index Terms:

paraphrase generation, natural language processing, knowledge distillation, deep learning, large language models

I Introduction

Paraphrase Generation, alternatively known as Question Paraphrase Generation, occupies a key role as a fundamental task within the field of Natural Language Processing (NLP). For several decades, this section of the field has been explored consistently and the derived results have found immense applications in enhancing data augmentation processes[1]. The influence of Paraphrase Generation on data augmentation is pivotal for optimizing the output of numerous NLP operations, which leads to a substantial enrichment of training data [2]. Throughout the years, extensive research has gone into this division of NLP, generating key strategies for optimizing the process [3].

Traditional methods of paraphrase generation, such as rule-based [4], thesaurus-based [5], and SMT-based methods [6], have been widely used in the past. However, they are inherently limited in their capacity to generate diverse and contextually accurate paraphrases, therefore recent research has been mainly based on neural-based approaches.

The advent of Large Language Models (LLMs) has significantly transformed the landscape of NLP, including the domain of paraphrase generation. These models, underpinned by advanced neural networks such as Transformers, have shown unparalleled performance in a spectrum of NLP applications, covering everything from categorizing texts and assessing sentiments to translating languages and textual content generation. [7]. LLMs undergo training using extensive text corpora and web resources, which equips them with the capability to decipher complex structures and patterns inherent in human language. Advanced techniques such as Reinforcement Learning with Human Feedback (RLHF) [8] have drastically improved its capabilities. This extensive training allows them to generate human-like writing, which is not just syntactically accurate but also pertinent and logically consistent within the context. Moreover, LLMs can understand and generate text in a context-dependent manner, an essential element in the creation of paraphrases.

Despite the remarkable proficiency of LLMs, several challenges remain. One of the most significant issues is the high number of parameters in these models. The sheer size of LLMs makes them difficult to run on consumer hardware, limiting their accessibility for many users and applications. Additionally, the large number of parameters also leads to longer inference times, which can be a bottleneck in real-time applications. These challenges highlight the urgent need for domain-specific models that maintain the high performance of LLMs while being more efficient and accessible. In the case of paraphrase generation, there is a need for models that can generate high-quality paraphrases quickly and efficiently, without the need for high-end hardware.

In response to this need, we have leveraged the power of LLMs to distill and build models that are significantly smaller than the original teacher models. Our models are approximately a thousand times smaller in terms of the number of parameters, making them much more efficient and easier to run on consumer hardware. Despite their smaller size, these models maintain the high-quality paraphrase generation capabilities of their larger counterparts, being a testament to the effectiveness of our approach. As shown in Section I, we have utilized a gold standard evaluation strategy which is not commonly seen in most paraphrase generation research [9]. In addition, results seen in Section IV-B1, qualitative results obtained by employing human evaluators, illustrate that the distilled models were indeed capable of maintaining the quality and diversity of the output despite being drastically smaller than the LLM it is trained on.

The results of our research not only contribute to the field of paraphrase generation but also demonstrate the potential of knowledge distillation as a strategy for leveraging the power of LLMs in a more efficient and accessible manner. We believe that our work lays the groundwork for upcoming studies within the field of paraphrase generation, opening up new possibilities for the application of LLMs effectively.

II Related Work

II-A Paraphrase Datasets

To support the exploration and creation of models for generating paraphrases, a variety of datasets have been established. The Paraphrase Database (PPDB) [10] is a comprehensive resource that contains over 220 million paraphrase pairs. However, its utility has been questioned due to its exclusive focus on phrasal and lexical paraphrases, neglecting sentence paraphrases.

The Twitter URL dataset [11] is a large-scale collection of sentential paraphrases sourced from Twitter. However, due to the noisiness of the labels, this dataset is not widely used. The Wiki Answer dataset [12] contains an estimated 18 million pairs of paraphrased questions, but it is limited in scope as all the sentences provided are in the form of questions.

The MSCOCO dataset [13] was primarily characterized as a comprehensive object detection dataset. It comprises over 120,000 images, each of which is accompanied by five distinct captions, contributed by five separate annotators. The Microsoft Research Paraphrase Corpus (MRPC) Dataset [14] comprises 5800 sentence pairs derived from online news sources. It also includes human annotations that denote whether each pair represents a paraphrase or semantic equivalent.

The Quora Dataset or Quora Question Pair Dataset [15] contains 150,000 question pairs that are annotated as paraphrases. The ParaNMT dataset [16] comprises over 50 million pairs of English sentential paraphrases. These pairs were independently created by utilizing bidirectional translation for converting the non-English elements within a significant Czech-to-English equivalent dataset.

In the development of the ParaBank datasets, a Czech-English Neural Machine Translation (NMT) system was employed to create new variations of English reference sentences [17, 18]. The PAWS Dataset [19] contains sentences with high bag-of-words (BOW) overlap but having different word order.

Despite the variety of datasets available, there are still challenges in using them for paraphrase generation. The noise introduced in these datasets due to utilizing techniques such as back-translation can lead to error propagation, which cannot be mitigated even by improving the architecture of the model. Therefore, a data-centric approach is needed to handle these issues.

II-B Paraphrase Generation

Paraphrase generation has seen significant advancements with the introduction of sophisticated techniques. These include multi-round generation [20], which involves generating multiple iterations of paraphrases, and reinforcement learning-based paraphrasing [21], which employs reinforcement learning algorithms to optimize the paraphrasing process. Another noteworthy approach is prompt-tuning [22], which fine-tunes the prompts used in the generation process to yield better paraphrases. There is also a subset of research dedicated to enhancing the syntactic diversity of the generated paraphrases. This is achieved through various methods such as sampling from latent spaces [23], which generates diverse paraphrases by sampling different points in the latent space, and controlling the word order [24], which manipulates the arrangement of words to create diverse paraphrases. However, these methods often focus on one aspect of diversity, typically neglecting lexical diversity. Furthermore, the quality of data used in these approaches often leaves room for improvement.

Generative adversarial networks (GANs) [25] have also been utilized in paraphrase creation. The distinct characteristics of textual content generation pose an obstacle to the conventional training techniques applied in GANs. To overcome this, the concept of policy gradient is employed [26]. Word-level paraphrasing is another technique that focuses on generating paraphrases by substituting original words with synonyms. Some researchers have leveraged external linguistic knowledge to achieve this [27], while others have proposed unique mechanisms to learn synonym mappings [28]. These mechanisms can significantly enhance lexical diversity.

In the search for more sophisticated paraphrase generation, experts have investigated techniques to manage the syntactic structure of paraphrased text, integrating various levels of detail. Studies in this field typically fall into two groups based on their approach to syntax control: implicit and explicit. In explicit control strategies, the sentence’s syntactic tree is transformed into vector forms, which are then incorporated into the decoder at each step of decoding [29]. Implicit control methods, alternatively, acquire knowledge of syntactic information distribution through a Variational Autoencoder (VAE). In this approach, a syntax variable, drawn from the learned distribution, is integrated into the decoder at every step of decoding. [30].

Building on these syntax-focused methods, multi-level paraphrasing techniques combine multiple granularity levels, enabling their models to generate synonyms, substitute phrases, and rearrange sentential structures [31]. These techniques aim to create a more comprehensive and nuanced approach to paraphrase generation.

However, a common challenge faced by most of these approaches stems from the scarcity of extensive corpora containing significant quality paraphrases. A high-quality paraphrase should be lexically diverse, syntactically diverse, grammatically correct, and semantically similar. Balancing all these aspects in paraphrase generation remains a significant challenge.

II-C Knowledge Distillation

The technique of knowledge distillation is utilized to educate compact models, frequently termed as the student, through leveraging the “knowledge” from a more expansive model, designated as the teacher [32]. A common method of knowledge distillation is to train the student with an additional goal of aligning with the teacher’s representation, such as logits, output probability, or intermediate activation [33].

In the context of sequence-to-sequence or generative models, [34] introduced the idea of sequence-level knowledge distillation. This method involves the generation of a synthetic output by conducting inference with the teacher model, which is subsequently used to train the learner/student model. The efficiency of sequence-level distillation lies in the fact that it only necessitates running the typically large teacher model once. The efficacy of sequence-level distillation has been demonstrated in previous studies [35]. Recent research has adopted this technique, particularly with the use of LLMs [36]. Further extensions of this technique have been explored, such as the use of reverse Kullback-Leibler divergence (KLD) objectives to enhance the distillation process [37]. This indicates a growing interest in the field of NLP in utilizing sequence-level distillation to develop smaller, yet effective models, especially with the recent research involving LLMs.

III Methodology

Refer to caption — Figure 1: High-Level Architecture Diagram for Training and Inference Phases.

We utilize a data-centric sequence-level knowledge distillation technique where we utilize the ChatGPT (gpt-3.5-turbo) LLM to create smaller models which are capable of producing diverse paraphrases to a quality similar but with a significantly less number of parameters. The detailed methodology will be broken down and explained in the subsequent sections.

III-A Dataset Creation

To create the full dataset we used multiple data sources. We incorporated a subset of the Quora Dataset [14], specifically choosing sentence pairs labeled as paraphrases. Next, we utilized PAWSWiki, a segment within the PAWS (Paraphrase Adversaries from Word Scrambling) Dataset[19]. However, we deliberately avoided using PAWSQQP as it contained source sentences identical to those in the Quora Dataset. These two are the main corpora employed in the training process. For evaluation purposes, the Microsoft Research Paraphrase Corpus (MRPC) Dataset [14], the MSCOCO dataset [13], the Wiki Answer dataset [12] and the Twitter URL dataset [11] was incorporated.

For the paraphrase pair generation phase, we initially combined the aforementioned data sources, selecting approximately 750,000 source sentences for paraphrase generation. An initial pass was conducted to filter out offensive content, using OpenAI’s Moderation Endpoint. This process flagged any source sentence that fell under the categories of offensive content, allowing us to filter out approximately source sentences containing such content. In addition to this, data sources created using back-translation often contain a significant amount of noise, including non-English source sentences. To tackle this, we developed a new prompt for the gpt-3.5-turbo model, instructing it to identify English sentences and generate paraphrases, or output ”Error” for non-English sentences. This method, coupled with a rule-based approach to filter out certain responses from the model, proved efficient in significantly reducing noise.

We then used the ChatGPT (gpt-3.5-turbo) LLM to augment the source sentences. The prompt used for each dataset varied slightly. For paraphrase generation, we adjusted the temperature parameter value to zero. The model output was a string containing augmented, diverse sentence paraphrases in a numbered list format. This string was subsequently processed to generate a list of strings, which was then used to construct a pool of paraphrases. By utilizing the paraphrase pool, we successfully generated nearly 2 million unique sentence paraphrase pairs. This was then used for model training and evaluation.

III-B Model Training

The models selected for distillation were T5-small [38], Flant5-small[39], and BART-base[40], each chosen for their unique strengths and capabilities. These models are built on an encoder-decoder architecture, a design that is particularly effective for tasks involving conditional generation, such as paraphrase generation.

The T5-small model is known for its efficiency and high performance in text generation tasks. The model is crafted to adeptly deal with a wide range of NLP tasks, making it a versatile choice for our project. Its small size, compared to larger versions of T5, makes it more computationally efficient while still delivering strong performance. The T5-small model has only around 60 million parameters.

The Flant5-small model is a multilingual version of T5. It has been trained on multiple languages, which could potentially enhance the quality of paraphrase generation by leveraging the syntactic and semantic similarities across languages. By including this model in our study, we aimed to explore the potential benefits of multilingual capabilities in paraphrase generation. The Flant5-small is slightly bigger than the T5-small model with around 80 million parameters.

Lastly, the BART-base model was included because of its impressive capabilities in generating text and summarizing information. BART stands out in its pretraining methodology, being designed to restore the original content after specific tokens are obscured or complete sentences are rearranged. This makes BART particularly effective at tasks that require understanding the broader context of a sentence, such as paraphrase generation. This model is slightly bigger with around 140 million parameters which is still a thousand times smaller than ChatGPT.

By distilling these three models, we aimed to leverage their unique strengths and capabilities to create a robust and efficient model for paraphrase generation.

For the training process, we utilized the Quora Question Pairs and PAWS datasets. To make the training more effective we lower cased the data and trained the models to further simplify the learning process so that models could learn on the core task effectively. The training method employed was Low-Rank Adaptation (LoRA) [41]. This method preserves the weights of the pre-trained model and integrates trainable rank decomposition matrices into every layer of the Transformer structure. Consequently, this strategy considerably lowers the number of parameters needed for subsequent tasks, resulting in reduced GPU memory usage and enhanced training throughput. The LoRA technique was applied through the use of a library specifically designed for Parameter-Efficient Fine-Tuning (PEFT), developed by Hugging Face¹¹1https://github.com/huggingface/peft. The training of each model was conducted using RTX A100 40GB GPU, using approximately 1.4 million rows of data. This approach is very effective in knowledge distillation when working with such a large volume of data.

The hyperparameters for the three models were set as follows: The models were programmed to accommodate a diverse range of sentence lengths by setting the maximum sequence length at 256. The models were trained for 10 epochs, providing a balance between training time and model performance. To ensure numerical stability, the Adam optimizer was utilized, accompanied by an epsilon value fixed at 1e-08. A learning rate of 0.0003 was determined, along with a maximum gradient norm limited to 1.0, which serves to inhibit excessively large gradients. The LoRA configuration for these models was set with a rank (r) of 8 and an alpha value of 32, providing a balance between model complexity and performance. To avert the risk of overfitting, the dropout rate was established at 0.1. In all models, “paraphrase” was used as the prefix for the input sentences during training. The training times for the models were as follows: T5-small took 24 hours, flant5-small took 30 hours, and BART-base took 50 hours to train for 10 epochs.

III-C Model Inference

Model Inference is a bit more tricky than anticipated since hyperparameters need to be explored to gain the optimal output. The hyperparameters for inference were set differently for each model to optimize their performance.

For BART, the maximum amount of new tokens for a generation was limited to 256, whilst early stopping was enabled to prevent the generation of unnecessary tokens. Sampling was enabled with 100 beams. The top-p value was set to 0.35 to control the nucleus sampling, and the no-repeat parameter for n-gram size was set to 2 so that it could prevent the model from generating repetitive phrases. The temperature was set to 2.5 to control the randomness of the output.

For the T5 model, the restriction on repeating n-grams was configured with a limit of 2. Moreover, the top-k parameter had a configuration value of 50,000 for its sampling method. The setting for temperature was adjusted to 0.7, and for nucleus sampling, the top-p parameter was calibrated to 0.75. Additionally, the model incorporated early stopping.

In the case of flan T5, early stopping was integrated. The configuration for sampling included a limitation on the n-gram repetition, which was established at 2. The beam search was conducted with 200 beams, and the temperature parameter was adjusted to a level of 1.5.

Some post-processing was done to refine the output of the models. Since the output from the models was in lowercase, the first letter of each sentence was capitalized to ensure proper sentence structure. Next, we performed entity case correction. For this, we used the Named Entity Recognition (NER) model from the SpaCy library²²2https://github.com/explosion/spaCy. The NER model identifies various entities within the text, including the names of people, organizations, or locations. We preserved the case capitalization for these entities, ensuring that proper nouns were correctly capitalized in the output. This step is crucial as it enhances the readability and accuracy of the generated sentences. Finally, if multiple sequences were output, we removed any duplicates to ensure the uniqueness of the generated paraphrases. This step helps to maintain the diversity of the output, providing a wider range of paraphrases for each input sentence.

IV Evaluation

TABLE I: Comparison of Semantic Similarity of the original dataset, distilled models and teacher model.

Model	ADA Score (↑)	SimCSE Score (↑)	PromCSE Score (↑)	Roberta Score (↑)	Mpnet Score (↑)
Original Dataset	90.88%	71.90%	98.68%	72.10%	69.79%
ChatGPT	95.60%	91.25%	99.41%	88.23%	87.03%
T5 Small (Ours)	97.28%	94.59%	99.67%	92.77%	92.60%
Flan T5 Small (Ours)	97.75%	95.42%	99.71%	93.71%	93.69%
BART Base (Ours)	98.07%	95.77%	99.72%	94.04%	93.77%

The evaluation was done using both quantitative and qualitative evaluation techniques. For the quantitative analysis, we utilized 100,000 paraphrase pairs extracted from the MRPC dataset, the MSCOCO evaluation subset, the Twitter URL dataset, and the Wiki Answer dataset. For the qualitative analysis, we incorporated human evaluations and a novel evaluation technique known as LLM evaluations. The subsequent section will outline our findings in the evaluation process.

IV-A Quantitative Analysis

For the quantitative analysis, we will be evaluating whether the paraphrases are high quality and diverse. We will assess three main characteristics: semantic similarity, syntactic diversity, and lexical diversity.

IV-A1 Semantic Simialrity

In our research, we evaluate semantic similarity by utilizing a range of models to produce embeddings of sentences from both the original text and its paraphrased form. Following this, we measure the degree of similarity using the cosine similarity method between these embeddings. The resulting scores are derived by taking one and subtracting the cosine similarity metric of the base and paraphrased text.

•

The “Ada Score” is computed using OpenAI’s text-embedding-ada-002 model [42].
•

The “SimCSE Score” is derived from the sup-simcse-roberta-large model by SimCSE [43].
•

The “PromCSE Score” is based on the sup-promcse-roberta-large model from PromCSE [44].

We also make use of several models from the sentence-transformers library [45]:

•

The “Mpnet Score” is derived by employing the model known as all-mpnet-base-v1.
•

The “Roberta Score” is determined through the utilization of the all-roberta-large-v1 model.

The data in Table I show that despite the distillation process the models have retained the semantic similarity compared to the teacher model. The difference in similarities is negligible likely caused due to different lexical phrases and sentence structures. The original dataset indicates that there is noise in the datasets despite being used in other research for training models.

IV-A2 Syntactic Diversity

Syntactic diversity is a measure of the range and intricacy of sentence structures in a paraphrase, given an original sentence. A high level of syntactic diversity suggests that the paraphrased sentences are varied and linguistically sophisticated, meaning that it is a characteristic of a high-quality paraphrase. We evaluate the diversity using metrics that take into account the sentence syntax trees.

•

The “Ted-F” encompasses the complete Tree Edit Distance measurement. This is achieved through constructing the constituency parse trees for the original and paraphrased sentences using Stanza [46], transforming the trees to bracket notation with the NLTK library [47] and regex, and then employing the APTED library [48] to compute the entire Tree Edit Distance.
•

The “Ted-3” represents Tree Edit Distance of the first three layers. It is calculated in a similar manner to “Ted-F”, but rather than calculating the comprehensive Tree Edit Distance, it focuses on the Tree Edit Distance for the first three strata.
•

The “Kermit Score” is computed by finding the cosine measurement of similarity between the original and the paraphrase syntactic vectors using the Kermit library [49], and then subtracting this similarity from one. The syntactic embeddings are obtained by feeding the syntax trees of the two sentences.
•

The “Subtree K Score” is the Subtree Kernel diversity. It is calculated by initially constructing constituency parse trees for the original and paraphrased sentences using Stanza, transforming the trees to an NLTK Tree, and then identifying all the subtrees. The kernel similarity is calculated by the ratio of unique common subtrees to the total count of unique subtrees. This figure is then subtracted from one to yield the diversity score.
•

The “Node Pair K Score” is the Subtree Node Pair Kernel diversity. It is calculated similarly to the “Subtree K Score”. The only difference is instead of subtrees this uses node pairs for the calculation.

Results in Table II show the full results of syntactic diversity. The data suggests that the distilled models have successfully retained the ability to generate syntactically diverse paraphrases similar to the teacher. This characteristic is unseen in most neural-based approaches which is a significant improvement.

TABLE II: Comparison of Syntactic Diversity of the original dataset, distilled models and teacher model.

Model	Ted-F (↑)	Ted-3 (↑)	Kermit Score (↑)	Subtree K Score (↑)	Node Pair K Score (↑)
Original Dataset	17.38	4.02	74.81%	95.36%	84.14%
ChatGPT	21.24	4.53	66.94%	92.77%	79.20%
T5 Small (Ours)	17.29	3.99	54.89%	83.36%	66.58%
Flan T5 Small (Ours)	18.38	4.40	54.96%	83.23%	65.45%
BART Base (Ours)	23.45	5.06	61.98%	88.97%	72.30%

TABLE III: Comparison of lexical diversity of the original dataset, distilled models and teacher model.

Model	BOW Overlap	Corpus BLEU	Corpus BLEU2	METOER	ROUGE 1	ROUGE 2
	Score (↑)	Score (↑)	Score(↑)	Score (↑)	Score(↑)	Score(↑)
Original Dataset	58.59%	99.40%	85.12%	59.28%	53.64%	77.02%
ChatGPT	50.55%	99.54%	84.01%	43.20%	42.56%	70.15%
T5 Small (Ours)	35.42%	99.46%	65.79%	29.50%	28.19%	50.78%
Flan T5 Small (Ours)	34.06%	99.48%	63.93%	27.80%	26.23%	47.05%
BART Base (Ours)	39.49%	99.50%	75.71%	36.27%	33.51%	59.79%
Model	ROUGE L	Token $\cap/\cup$	TER	WER	CharacTER	Google BLEU
	Score (↑)	Score (↑)	Score (↑)	Score (↑)	Score (↑))	Score (↑)
Original Dataset	58.49%	69.73%	80.45	85.44	69.27	80.88%
ChatGPT	54.17%	62.93%	63.05	77.44	77.90	77.89%
T5 Small (Ours)	42.17%	43.52%	49.05	66.83	54.80	59.55%
Flan T5 Small (Ours)	41.98%	40.94%	48.77	69.97	56.02	57.91%
BART Base (Ours)	52.11%	49.68%	59.23	82.47	68.00	68.15%

IV-A3 Lexical Diversity

Lexical diversity is a concept that signifies the array and diversity of words incorporated in a text. It serves as an indicator of the width of vocabulary and the application of synonyms. In the sphere of paraphrasing, it becomes essential to gauge lexical diversity to grasp the level of vocabulary fluctuation. We utilized an array of metrics to gauge lexical diversity.

•

“BOW Overlap Score” is calculated by identifying the shared tokens between the original and the paraphrased text, and dividing by the total count of tokens. This value is then subtracted by one.
•

“Corpus BLEU Score” is evaluated using the SacreBLEU Library [50]. This score is then subtracted by one.
•

“Corpus BLEU2 Score” is constructed using the SacreBLEU Library with the “method1” smoothing function. This score is then subtracted by one.
•

“Sentence BLEU Score” is calculated similarly to the Corpus BLEU score using the SacreBLEU Library but at the sentence level. This score is also subtracted by one.
•

“METEOR Score” is evaluated using the NLTK library. This score is then subtracted by one.
•

“ROUGE 1 Score” is constructed using the Google Research library³³3https://github.com/google-research/google-research. This score is then subtracted by one.
•

“ROUGE 2 Score” is evaluated using the Google Research library. This score is then subtracted by one.
•

“ROUGE L Score” is calculated using the Google Research library. This score is then subtracted by one.
•

“Token $\cap/\cup$ Score” is similar to the BOW Overlap score but with a minor variation. It is calculated using the shared tokens between the original and the paraphrased text, divided by the total unique tokens. This value is then subtracted by one.
•

“Google BLEU Score” is evaluated using Huggingface’s Evaluate library. This score is then subtracted by one.⁴⁴4https://github.com/huggingface/evaluate
•

“TER Score” is the Translation Error Rate score which is constructed using Huggingface’s Evaluate library.
•

“WER Score” is the Word Error Rate score evaluated using Huggingface’s Evaluate library.
•

“CharacTER Score” is the Character Error Rate score calculated using Huggingface’s Evaluate library.

Results in Table III show the distilled models were able to generate lexically diverse paraphrases on par with the teacher model. Even though the teacher model is better, models still exhibit significant lexical diversity similar to that of the teacher.

TABLE IV: Human evaluation Results.

Model	Semantic Similarity (↑)	Lexical Diversity (↑)	Syntactic Diversity (↑)	Grammatical Correctness (↑)
Original Dataset	3.47	3.08	2.93	4.53
ChatGPT	4.33	3.43	3.26	4.97
T5 Small (Ours)	4.21	3.23	2.95	4.83
Flan T5 Small (Ours)	4.05	3.19	2.83	4.74
BART Base (Ours)	4.20	3.25	3.16	4.89

TABLE V: LLM evaluation results.

Model	Semantic Similarity (↑)	Lexical Diversity (↑)	Syntactic Diversity (↑)	Grammatical Correctness (↑)
Original Dataset	3.40	2.83	2.88	4.37
ChatGPT	4.82	3.01	3.85	4.89
T5 Small (Ours)	4.51	2.48	3.20	4.68
Flan T5 Small (Ours)	4.26	2.20	2.85	4.41
BART Base (Ours)	4.75	2.63	3.41	4.69

IV-B Qualitative Analysis

In order to gain an in-depth insight of the performance of our model, we conducted a qualitative analysis using two distinct methods: human evaluation and Large Language Model (LLM) evaluation.

IV-B1 Human Evaluation

We recruited a group of five independent evaluators who are proficient in English for the human evaluation phase. The evaluation process involved the selection of thousand paraphrase pairs from four different data sources: the MRPC dataset, the MSCOCO dataset evaluation subset, the Twitter URL dataset, and the Wiki Answer dataset. From each of these data sources, 250 paraphrase pairs were selected, ensuring a balanced representation. The selection process was carefully conducted to ensure that the lengths of the sentences from each data source were accurately represented. For each of the thousand paraphrase pairs, we obtained model outputs from four different models: ChatGPT, and the three trained models. This resulted in a total annotation set of 5000 pairs for evaluation, which includes both the original dataset paraphrase pairs and the model outputs. The extensive set of evaluations facilitates a comprehensive and reliable assessment of each model’s performance.

In our assessment, we employed a 5-point Likert scale [51] for the evaluation of Semantic Similarity, Lexical Diversity, Syntactic Diversity, and Grammatical Correctness. The breakdown of the Likert scale is as follows:

•

For Semantic Similarity, a score of 5 implies that the text’s meaning aligns perfectly or almost perfectly with the source text. This could suggest that the text expresses the same ideas, reaches the same conclusions, or presents the same arguments. Conversely, a score of 1 in semantic similarity would suggest that the text’s meaning is entirely different or unrelated to the source text, indicating differing ideas, conclusions, or arguments.
•

For Lexical Diversity, a score of 5 signifies a broad and rich vocabulary. This could suggest that the text employs a diverse array of words, synonyms, and phrases, and avoids repetition. In contrast, a score of 1 in lexical diversity would suggest a limited vocabulary range, indicating that the text uses a small set of words, heavily relies on a few key phrases, or frequently repeats the same words or phrases.
•

For Syntactic Diversity, a score of 5 denotes a high degree of variation in sentence structure. This could suggest that the text employs a variety of sentence types, lengths, and structures, and avoids repetition. Conversely, a score of 1 in syntactic diversity would suggest minimal variation in sentence structure, indicating that the text uses a limited number of sentence types, heavily relies on a few key structures, or frequently repeats the same sentence structures.
•

For Grammatical Correctness, a score of 5 signifies perfect grammar. This could suggest that the text employs correct punctuation, spelling, and syntax, and avoids grammatical errors. Conversely, a score of 1 in grammatical correctness would suggest significant errors that affect comprehension, indicating that the text contains frequent spelling, punctuation, or syntax errors, or that these errors hinder its comprehensibility.

The evaluation instruction given to the human evaluators can be seen in Fig. A2. Grammatical correctness is one aspect that is normally evaluated by other research work but is crucial to identifying the effectiveness of a paraphrase. The final data obtained from the human evaluation are given in Table IV

IV-B2 LLM Evaluation

Over the recent year, the application of Language Model Metrics (LLMs) for evaluation in NLP has seen some rise. This surge in popularity is primarily attributed to the superior performance of LLMs over existing reference-free metrics [52]. Recognizing this potential, our research also embarked on an evaluation strategy using LLMs, utilizing OpenAI’s gpt-4 model, which was the state-of-the-art (SOTA) LLM at the time of conducting this study [53].

The data used for this evaluation was identical to the one provided to our human annotators. We crafted the prompt in alignment with the instructions given to the human annotators, as depicted in Fig. A1. This approach ensured a fair evaluation strategy to that of the human evaluation.

This innovative methodology employed in our research could serve as a valuable reference for future studies in this field. It not only provides a new perspective on the use of LLMs in NLP but also opens up possibilities for further exploration and development of more advanced and efficient evaluation techniques. The final results of the LLM evaluation are given in Table V

V Discussion

Our evaluation process, which combined both quantitative and qualitative methods, provided a comprehensive understanding of the performance of our models. The quantitative analysis focused on three main characteristics: semantic similarity, syntactic diversity, and lexical diversity. The results indicated that, despite the distillation process, our models were able to maintain semantic similarity, syntactic diversity, and lexical diversity in comparison to the teacher model (ChatGPT).

Our comprehensive qualitative evaluation, which incorporated human evaluation, provided an accurate and in-depth understanding of our research outcomes. The results distinctly demonstrated that, despite our models being a thousand times smaller than ChatGPT, they were able to maintain comparable performance levels. This is a significant achievement, highlighting the efficiency and effectiveness of our models in generating high-quality paraphrases. Moreover, the use of LLM evaluation in our study introduced a novel approach to performance assessment in our research domain. This innovative strategy, which leverages the capabilities of SOTA language models for evaluation, offers a promising avenue for future research. With further refinement and development, this LLM evaluation approach has the potential to serve as a robust benchmarking tool for assessing the performance of paraphrasing models and other natural language processing tasks.

In our evaluation, we employed a wide array of metrics to assess various aspects of the generated paraphrases. This approach was adopted in recognition of the fact that a single metric may not fully capture the effectiveness of a paraphrase, as it tends to focus on a specific characteristic. This highlights an active area of research that warrants further exploration. The use of inappropriate or insufficient metrics can lead to a skewed understanding of the models’ performance. Therefore, future research should emphasize the development of a unified metric that can holistically evaluate the quality of paraphrases. Such a metric would not only provide a more precise assessment of the model performance but also contribute to the advancement of the field of paraphrase generation.

In our methodology even though the distilled models were able to retain the quality of the distilled knowledge. One area where the ChatGPT is superior is the ability to generate paraphrases that are diverse from each other. This means that there was a notable variation among the paraphrases generated by ChatGPT. Our models, however, did not guarantee the same level of diversity, indicating a need for further research to optimize the inference hyperparameters to enhance the diversity of the generated paraphrases. Random Sampling could be a potential starting point for this approach, but more extensive work is required to fully address this issue.

Since our models were trained using ChatGPT, they may inherit potential risks associated with it. This could include the propagation of biases inherent in the train dataset. Therefore, it is crucial for future research to address these issues and develop strategies to mitigate the potential risks associated with the use of LLMs with knowledge distillation for paraphrase generation.

VI Conclusion

Our research offers a more efficient and cost-effective solution for paraphrase generation, making it more accessible for various applications. The distilled models provide a viable alternative to using large LLMs, opening up new possibilities for their application in real-world scenarios. A notable result of our models is the ability to produce both lexically and syntactically diverse paraphrases. Our models were at least a thousand times smaller than the original LLM and were still able to perform on par with it.

Overall, our work sets the foundation for future advancements in parameter-efficient and diverse paraphrase generation. It underscores the opportunities for additional exploration and innovation in this domain, paving the way for more efficient and accessible solutions in the field of NLG and paraphrase generation.

VII Acknowledgment

We wish to extend our appreciation and thanks to the research participants who participated in the human evaluation of the models. Their insights and feedback have been instrumental in assessing the quality and effectiveness of our research.

References

[1] K. R. McKeown, “Paraphrasing using given and new information in a question-answer system,” in Proceedings of the 17th annual meeting on Association for Computational Linguistics -. La Jolla, California: Association for Computational Linguistics, 1979, p. 67. [Online]. Available: http://portal.acm.org/citation.cfm?doid=982163.982182
[2] M. Meteer and V. Shaked, “Strategies for effective paraphrasing,” in Proceedings of the 12th conference on Computational linguistics -, vol. 2. Budapest, Hungry: Association for Computational Linguistics, 1988, pp. 431–436. [Online]. Available: http://portal.acm.org/citation.cfm?doid=991719.991724
[3] R. Kozlowski, K. F. McCoy, and K. Vijay-Shanker, “Generation of single-sentence paraphrases from predicate/argument structure using lexico-grammatical resources,” in Proceedings of the second international workshop on Paraphrasing -, vol. 16. Sapporo, Japan: Association for Computational Linguistics, 2003, pp. 1–8. [Online]. Available: http://portal.acm.org/citation.cfm?doid=1118984.1118985
[4] D. Lin and P. Pantel, “Discovery of inference rules for question-answering,” Natural Language Engineering, vol. 7, no. 4, pp. 343–360, Dec. 2001. [Online]. Available: https://www.cambridge.org/core/product/identifier/S1351324901002765/type/journal_article
[5] D. Kauchak and R. Barzilay, “Paraphrasing for Automatic Evaluation,” in Proceedings of the Human Language Technology Conference of the NAACL, Main Conference. New York City, USA: Association for Computational Linguistics, Jun. 2006, pp. 455–462. [Online]. Available: https://aclanthology.org/N06-1058
[6] S. Wubben, A. van den Bosch, and E. Krahmer, “Paraphrase Generation as Monolingual Translation: Data and Evaluation,” in Proceedings of the 6th International Natural Language Generation Conference. Association for Computational Linguistics, Jul. 2010. [Online]. Available: https://aclanthology.org/W10-4223
[7] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, Y. Du, C. Yang, Y. Chen, Z. Chen, J. Jiang, R. Ren, Y. Li, X. Tang, Z. Liu, P. Liu, J.-Y. Nie, and J.-R. Wen, “A Survey of Large Language Models,” Sep. 2023, arXiv:2303.18223 [cs] version: 12. [Online]. Available: http://arxiv.org/abs/2303.18223
[8] Z. Li, Z. Yang, and M. Wang, “Reinforcement Learning with Human Feedback: Learning Dynamic Choices via Pessimism,” 2023. [Online]. Available: https://arxiv.org/abs/2305.18438
[9] J. Zhou and S. Bhat, “Paraphrase Generation: A Survey of the State of the Art,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, Eds. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 5075–5086. [Online]. Available: https://aclanthology.org/2021.emnlp-main.414
[10] J. Ganitkevitch, B. Van Durme, and C. Callison-Burch, “PPDB: The Paraphrase Database,” in Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Atlanta, Georgia: Association for Computational Linguistics, Jun. 2013, pp. 758–764. [Online]. Available: https://aclanthology.org/N13-1092
[11] W. Lan, S. Qiu, H. He, and W. Xu, “A Continuously Growing Dataset of Sentential Paraphrases,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark: Association for Computational Linguistics, 2017, pp. 1224–1234. [Online]. Available: http://aclweb.org/anthology/D17-1126
[12] A. Fader, L. Zettlemoyer, and O. Etzioni, “Paraphrase-Driven Learning for Open Question Answering,” in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Sofia, Bulgaria: Association for Computational Linguistics, Aug. 2013, pp. 1608–1618. [Online]. Available: https://aclanthology.org/P13-1158
[13] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common Objects in Context,” in Computer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds. Cham: Springer International Publishing, 2014, vol. 8693, pp. 740–755, series Title: Lecture Notes in Computer Science. [Online]. Available: http://link.springer.com/10.1007/978-3-319-10602-1_48
[14] W. B. Dolan and C. Brockett, “Automatically Constructing a Corpus of Sentential Paraphrases,” in Proceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005. [Online]. Available: https://aclanthology.org/I05-5002
[15] S. Iyer, N. Dandeka, and K. Csernai, “First Quora Dataset Release: Question Pairs,” 2017. [Online]. Available: https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs
[16] J. Wieting and K. Gimpel, “ParaNMT-50M: Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics, 2018, pp. 451–462. [Online]. Available: http://aclweb.org/anthology/P18-1042
[17] J. E. Hu, R. Rudinger, M. Post, and B. Van Durme, “ParaBank: Monolingual Bitext Generation and Sentential Paraphrasing via Lexically-constrained Neural Machine Translation,” Jan. 2019, arXiv:1901.03644 [cs]. [Online]. Available: http://arxiv.org/abs/1901.03644
[18] J. E. Hu, A. Singh, N. Holzenberger, M. Post, and B. Van Durme, “Large-Scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering,” in Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL). Hong Kong, China: Association for Computational Linguistics, 2019, pp. 44–54. [Online]. Available: https://www.aclweb.org/anthology/K19-1005
[19] Y. Zhang, J. Baldridge, and L. He, “PAWS: Paraphrase Adversaries from Word Scrambling,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 1298–1308. [Online]. Available: https://aclanthology.org/N19-1131
[20] Z. Lin and X. Wan, “Pushing Paraphrase Away from Original Sentence: A Multi-Round Paraphrase Generation Approach,” 2021. [Online]. Available: https://arxiv.org/abs/2109.01862
[21] M. Liu, E. Yang, D. Xiong, Y. Zhang, Y. Meng, C. Hu, J. Xu, and Y. Chen, “A Learning-Exploring Method to Generate Diverse Paraphrases with Multi-Objective Deep Reinforcement Learning,” in Proceedings of the 28th International Conference on Computational Linguistics, D. Scott, N. Bel, and C. Zong, Eds. Barcelona, Spain (Online): International Committee on Computational Linguistics, Dec. 2020, pp. 2310–2321. [Online]. Available: https://aclanthology.org/2020.coling-main.209
[22] J. R. Chowdhury, Y. Zhuang, and S. Wang, “Novelty Controlled Paraphrase Generation with Retrieval Augmented Conditional Prompt Tuning,” 2022. [Online]. Available: https://arxiv.org/abs/2202.00535
[23] Y. Cao and X. Wan, “DivGAN: Towards Diverse Paraphrase Generation via Diversified Generative Adversarial Network,” in Findings of the Association for Computational Linguistics: EMNLP 2020, T. Cohn, Y. He, and Y. Liu, Eds. Online: Association for Computational Linguistics, Nov. 2020, pp. 2411–2421. [Online]. Available: https://aclanthology.org/2020.findings-emnlp.218
[24] T. Goyal and G. Durrett, “Neural Syntactic Preordering for Controlled Paraphrase Generation,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, Eds. Online: Association for Computational Linguistics, Jul. 2020, pp. 238–252. [Online]. Available: https://aclanthology.org/2020.acl-main.22
[25] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Networks,” 2014. [Online]. Available: https://arxiv.org/abs/1406.2661
[26] L. Yu, W. Zhang, J. Wang, and Y. Yu, “SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient,” Aug. 2017, arXiv:1609.05473 [cs]. [Online]. Available: http://arxiv.org/abs/1609.05473
[27] Z. Lin, Z. Li, N. Ding, H.-T. Zheng, Y. Shen, W. Wang, and C.-Z. Zhao, “Integrating Linguistic Knowledge to Sentence Paraphrase Generation,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, pp. 8368–8375, Apr. 2020. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/6354
[28] Y. Fu, Y. Feng, and J. P. Cunningham, “Paraphrase Generation with Latent Bag of Words,” Jan. 2020, arXiv:2001.01941 [cs]. [Online]. Available: http://arxiv.org/abs/2001.01941
[29] A. Kumar, K. Ahuja, R. Vadapalli, and P. Talukdar, “Syntax-Guided Controlled Generation of Paraphrases,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 329–345, 2020. [Online]. Available: https://aclanthology.org/2020.tacl-1.22
[30] W. Chen, J. Tian, L. Xiao, H. He, and Y. Jin, “A Semantically Consistent and Syntactically Variational Encoder-Decoder Framework for Paraphrase Generation,” in Proceedings of the 28th International Conference on Computational Linguistics, D. Scott, N. Bel, and C. Zong, Eds. Barcelona, Spain (Online): International Committee on Computational Linguistics, Dec. 2020, pp. 1186–1198. [Online]. Available: https://aclanthology.org/2020.coling-main.102
[31] A. Kazemnejad, M. Salehi, and M. Soleymani Baghshah, “Paraphrase Generation by Learning How to Edit from Samples,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, Eds. Online: Association for Computational Linguistics, Jul. 2020, pp. 6010–6021. [Online]. Available: https://aclanthology.org/2020.acl-main.535
[32] G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a Neural Network,” 2015. [Online]. Available: https://arxiv.org/abs/1503.02531
[33] X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu, “TinyBERT: Distilling BERT for Natural Language Understanding,” in Findings of the Association for Computational Linguistics: EMNLP 2020, T. Cohn, Y. He, and Y. Liu, Eds. Online: Association for Computational Linguistics, Nov. 2020, pp. 4163–4174. [Online]. Available: https://aclanthology.org/2020.findings-emnlp.372
[34] Y. Kim and A. M. Rush, “Sequence-Level Knowledge Distillation,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, J. Su, K. Duh, and X. Carreras, Eds. Austin, Texas: Association for Computational Linguistics, Nov. 2016, pp. 1317–1327. [Online]. Available: https://aclanthology.org/D16-1139
[35] N. Bogoychev, R. Grundkiewicz, A. F. Aji, M. Behnke, K. Heafield, S. Kashyap, E.-I. Farsarakis, and M. Chudyk, “Edinburgh’s Submissions to the 2020 Machine Translation Efficiency Task,” in Proceedings of the Fourth Workshop on Neural Generation and Translation, A. Birch, A. Finch, H. Hayashi, K. Heafield, M. Junczys-Dowmunt, I. Konstas, X. Li, G. Neubig, and Y. Oda, Eds. Online: Association for Computational Linguistics, Jul. 2020, pp. 218–224. [Online]. Available: https://aclanthology.org/2020.ngt-1.26
[36] M. Wu, A. Waheed, C. Zhang, M. Abdul-Mageed, and A. F. Aji, “LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions,” May 2023, arXiv:2304.14402 [cs]. [Online]. Available: http://arxiv.org/abs/2304.14402
[37] Y. Gu, L. Dong, F. Wei, and M. Huang, “Knowledge Distillation of Large Language Models,” Jun. 2023, arXiv:2306.08543 [cs]. [Online]. Available: http://arxiv.org/abs/2306.08543
[38] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer,” 2019. [Online]. Available: https://arxiv.org/abs/1910.10683
[39] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma, A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, A. Castro-Ros, M. Pellat, K. Robinson, D. Valter, S. Narang, G. Mishra, A. Yu, V. Zhao, Y. Huang, A. Dai, H. Yu, S. Petrov, E. H. Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V. Le, and J. Wei, “Scaling Instruction-Finetuned Language Models,” 2022. [Online]. Available: https://arxiv.org/abs/2210.11416
[40] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension,” 2019. [Online]. Available: https://arxiv.org/abs/1910.13461
[41] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-Rank Adaptation of Large Language Models,” 2021. [Online]. Available: https://arxiv.org/abs/2106.09685
[42] OpenAI, “New and Improved Embedding Model,” 2023, publisher: OpenAI. [Online]. Available: https://openai.com/blog/new-and-improved-embedding-model
[43] T. Gao, X. Yao, and D. Chen, “SimCSE: Simple Contrastive Learning of Sentence Embeddings,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 6894–6910. [Online]. Available: https://aclanthology.org/2021.emnlp-main.552
[44] Y. Jiang, L. Zhang, and W. Wang, “Improved Universal Sentence Embeddings with Prompt-based Contrastive Learning and Energy-based Learning,” in Findings of the Association for Computational Linguistics: EMNLP 2022. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, Dec. 2022, pp. 3021–3035. [Online]. Available: https://aclanthology.org/2022.findings-emnlp.220
[45] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Nov. 2019. [Online]. Available: http://arxiv.org/abs/1908.10084
[46] P. Qi, Y. Zhang, Y. Zhang, J. Bolton, and C. D. Manning, “Stanza: A Python Natural Language Processing Toolkit for Many Human Languages,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2020.
[47] S. Bird, E. Loper, and E. Klein, Natural Language Processing with Python. O’Reilly Media Inc., 2009.
[48] M. Pawlik and N. Augsten, “Efficient Computation of the Tree Edit Distance,” ACM Transactions on Database Systems, vol. 40, no. 1, pp. 1–40, Mar. 2015. [Online]. Available: https://dl.acm.org/doi/10.1145/2699485
[49] F. M. Zanzotto, A. Santilli, L. Ranaldi, D. Onorati, P. Tommasino, and F. Fallucchi, “KERMIT: Complementing Transformer Architectures with Encoders of Explicit Syntactic Interpretations,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, Nov. 2020, pp. 256–267. [Online]. Available: https://www.aclweb.org/anthology/2020.emnlp-main.18
[50] M. Post, “A Call for Clarity in Reporting BLEU Scores,” in Proceedings of the Third Conference on Machine Translation: Research Papers. Belgium, Brussels: Association for Computational Linguistics, Oct. 2018, pp. 186–191. [Online]. Available: https://www.aclweb.org/anthology/W18-6319
[51] C. Van Der Lee, A. Gatt, E. Van Miltenburg, S. Wubben, and E. Krahmer, “Best practices for the human evaluation of automatically generated text,” in Proceedings of the 12th International Conference on Natural Language Generation. Tokyo, Japan: Association for Computational Linguistics, 2019, pp. 355–368. [Online]. Available: https://www.aclweb.org/anthology/W19-8643
[52] Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, and C. Zhu, “G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment,” 2023, publisher: arXiv Version Number: 3. [Online]. Available: https://arxiv.org/abs/2303.16634
[53] OpenAI, “GPT-4 Technical Report,” 2023, publisher: arXiv Version Number: 3. [Online]. Available: https://arxiv.org/abs/2303.08774

Appendix A Appendix A

Source Text: $source_text Paraphrase: $paraphrase Please evaluate the following aspects of the paraphrase in comparison to its source text on a likert scale of 1 to 5, where: Semantic Similarity: This refers to how closely the meaning of the paraphrase matches the meaning of the source text. Rating Scale for Semantic Similairty 1: The paraphrase has a completely different meaning or is unrelated to the source text. 2: The paraphrase has a somewhat different meaning from the source text 3: The paraphrase captures the general idea of the source text, but some details or nuances are missing. 4: The paraphrase largely captures the meaning of the source text but may have slight differences in wording or expression. 5: The paraphrase has an identical or nearly identical meaning to the source text. Lexical Diversity: This aspect evaluates the range and richness of vocabulary used in the paraphrase, considering its comparison to the source text. Rating Scale for Lexical Diversity 1: The paraphrase shows a limited use of words and lacks diversity when compared to the source text. 2: The paraphrase exhibits some variation in word choice but heavily relies on a few specific terms, which may not reflect the lexical diversity of the source text. 3: The paraphrase demonstrates moderate diversity in vocabulary, but there is room for improvement in terms of incorporating more varied word choices from the source text. 4: The paraphrase displays a good range of vocabulary, utilizing several different words and expressions that align with the lexical diversity of the source text. 5: The paraphrase showcases an extensive array of vocabulary, demonstrating excellent lexical diversity that closely matches or surpasses the richness of the source text. Syntactic Diversity: This aspect assesses the structural variations in the paraphrase compared to the source text. Rating Scale for Syntactic Diversity 1: The paraphrase closely mirrors the sentence structure of the source text with minimal variation. 2: The paraphrase shows some minor changes in sentence structure but largely follows the same pattern as the source text. 3: The paraphrase introduces moderate variations in sentence structure, deviating from the structure of the source text in certain aspects. 4: The paraphrase exhibits significant syntactic diversity, using different sentence structures while still conveying the same meaning as the source text. 5: The paraphrase displays a high level of syntactic diversity, employing various sentence structures creatively while maintaining the meaning of the source text. Grammatical Correctness: This evaluates the grammatical accuracy of the paraphrase. Rating Scale for Grammatical Correctness 1: The paraphrase contains numerous grammatical errors that significantly impact comprehension. 2: The paraphrase has several grammatical errors that occasionally affect understanding. 3: The paraphrase includes some grammatical errors, but they do not hinder overall comprehension. 4: The paraphrase demonstrates good grammatical correctness with only occasional minor errors. 5: The paraphrase is grammatically flawless, with no errors or inaccuracies. Please provide your ratings for each aspect using the following json format: {”Semantic Similarity”: [Rating from 1 to 5], ”Lexical Diversity”: [Rating from 1 to 5], ”Syntactic Diversity”: [Rating from 1 to 5], ”Grammatical Correctness”: [Rating from 1 to 5]}

Figure A1: This figure illustrates the prompt fed to the gpt-4 model for evaluation.