IdEALS: Idiomatic Expressions for Advancement of Language Skills

Narutatsu Ri¹, Bill Sun¹, Sam Davidson², Zhou Yu¹
¹Columbia University ²University of California, Davis
{wl2787, bys2017, zy2461}@columbia.edu
[email protected]

Abstract

Although significant progress has been made in developing methods for Grammatical Error Correction (GEC), addressing word choice improvements has been notably lacking and enhancing sentence expressivity by replacing phrases with advanced expressions is an understudied aspect. In this paper, we focus on this area and present our investigation into the task of incorporating the usage of idiomatic expressions in student writing. To facilitate our study, we curate extensive training sets and expert-annotated testing sets using real-world data and evaluate various approaches and compare their performance against human experts.

1 Introduction

Grammatical Error Correction is a vital task that looks to improve student writing skills by addressing syntactic errors in written text, and various methods have been developed to correct syntactic errors (Chollampatt and Ng, 2018; Omelianchuk et al., 2020; Choshen et al., 2021, among others). However, there remains a gap in the field for improving writing skills through enhancing the naturalness and quality of learner writing by replacing grammatically correct phrases with semantically equivalent but advanced expressions. Hence, existing tools have been found inadequate in providing such suggestions Kham Thi and Nikolov (2021).

To address this gap, we consider the task of providing replacement suggestions that incorporate idiomatic expressions (IEs) (Figure 1). Prior research proposes idiomatic sentence generation (ISG) as the task of transforming sentences into alternative versions including idioms (Zhou et al., 2021), and preserving semantic content has been extensively studied in paraphrase generation (McKeown, 1979; Wubben et al., 2010; Prakash et al., 2016, etc.). However, such methods are not designed specifically for writing improvement, and standard sequence-to-sequence approaches have been proven ineffective for the ISG task. Additionally, current evaluation metrics are not directly applicable to writing improvement, emphasizing the necessity for methods tailored for this purpose.

In this work, we introduce Idiomatic Sentence Generation for Writing Improvement (ISG-WI) as a distinct task for writing improvement under the ISG task. We enhance existing datasets for ISG specifically for word choice recommendation by creating a training set with a broader range of idiomatic expressions and associated training data along with a testing set comprising real-world student-written sentences annotated by human experts, which we refer to as the Idiomatic Expressions for Advancement of Language Skills (IdEALS) dataset. We provide precise definitions for performance metrics and explore two different approaches to the ISG-WI task.¹¹1Dataset and code can be found at https://github.com/narutatsuri/isg_wi.

We summarize the main contributions of this work as follows:

Refer to caption — Figure 1: Examples of the ISG-WI task. Given a student-written sentence, the task seeks to enhance word choice usage by providing replacement suggestions for words or sentence constituents with idiomatic expressions.

1.

We compile a dataset consisting of a large-scale training set and an expert-annotated test set comprised of real student-written essays.
2.

We propose precise metrics to evaluate the performance of the ISG-WI task, and benchmark two approaches against expert human annotators.

2 Related Work

Grammatical Error Correction. Recent studies in Grammatical Error Correction (GEC) have focused on two main directions: generating synthetic data for pre-training Transformer-based models (Kiyono et al., 2019; Grundkiewicz et al., 2019; Zhou et al., 2020) and developing models specifically designed for correcting grammar errors (Chollampatt et al., 2016; Nadejde and Tetreault, 2019). In a related line of research, Zomer and Frankenberg-Garcia (2021) explore writing improvement models that address errors resulting from the student’s native language, going beyond surface-level grammatical corrections. Our work shares a similar objective in aiming to move beyond syntactic mistakes and instead promote advanced word choice usage.

Paraphrase Generation. Paraphrase models often draw inspiration from machine translation techniques (Wubben et al., 2012; Mallinson et al., 2017; Gupta et al., 2017; Prakash et al., 2016; Hegde and Patil, 2020; Niu et al., 2020) with some efforts devoted to domain-specific paraphrasing (Li et al., 2017; Xu et al., 2018). Various paraphrase datasets have been curated for training sequence-to-sequence paraphrase models (Dolan and Brockett, 2005; Ganitkevitch et al., 2013; Wang et al., 2017). Notably, Zhou et al. (2021) introduce the ISG task and propose a corresponding dataset. Building upon their work, we focus specifically on ISG for writing improvement by constructing a larger training dataset and an expert-annotated testing dataset using real-world student-written text and investigate approaches for the ISG-WI task.

3 The IdEALS Dataset

Here, we detail our procedure for curating the training and testing set. A comparison with the PIE dataset Zhou et al. (2021) is provided in Table 2.

Phrase Type	Phrases	Size
Idioms (Mono.)	184	1722
Idioms (Poly.)	90	1005
Phrasal Verbs	189	1944
Prep. Phrases	505	4156
Total	968	8827

Changes	Count
$0$	454
$1$	326
$2$	28
$3$	2
Total	810

Table 1: Corpus statistics for each dataset in the PIES corpus. Left and right tables correspond to the training and testing set respectively. "Idioms" refer to idiomatic expressions, and "Phrases" denote the number of phrases each type contains.

Dataset	Train	Phrases	Test	Real Data
PIE	3,524	823	1,646	✗
IdEALS	8,827	968	810	✓

Table 2: Comparision of the PIES dataset Zhou et al. (2021) and the IdEALS dataset. "Train," "Test" respectively refer to train and test data size. "Real Data" indicates whether the test data utilizes real-world student-written sentences.

3.1 Training Set

The training set of the IdEALS dataset comprises sentence pairs consisting of an original sentence and a corresponding paraphrased sentence, where a subpart of the original is replaced with an idiomatic expression where appropriate. Dataset statistics can be found in Table 1.

IE Collection. The training set includes a collection of idioms, prepositional phrases, and phrasal verbs as potential replacements. To curate idioms, we leverage the EPIE dataset (Saxena and Paul, 2020), which contains 358 static idiomatic expressions. After removing expressions unsuitable for written essays (e.g., "you bet"), we extract 274 idioms and gather synonyms for each idiomatic expression by scraping online sources. As no publicly available datasets exist for phrasal verbs and prepositional phrases, we obtain 1000 phrasal verbs and 633 prepositional phrases from online educational sources²²2Phrasal verbs: https://www.englishclub.com/store/product/1000-phrasal-verbs-in-context/
Prepositional phrases: https://7esl.com/prepositional-phrase/. We then filter out trivial phrases (e.g., "calm down") to extract 189 phrasal verbs and 505 prepositional phrases, each accompanied by example sentence usages.

Sentence Pair Generation. For idioms, we utilize the example sentences provided in the EPIE dataset. We create parallel sentence pairs by replacing idioms in the examples with synonyms, which serve as the original sentences. As the number of example sentences for phrasal verbs and prepositional phrases is limited compared to idioms, we address this by adopting in-context learning methods (Brown et al., 2020) and construct prompts for large language models to generate additional example sentences. Subsequently, we manually verify each sentence pair to ensure grammatical correctness and semantic consistency between the original and paraphrased pairs. Prompt details are provided in Appendix A.3.

3.2 Testing Set

In order to assess the effectiveness of ISG-WI methods on student-written text, it is essential to utilize real-world student-written sentences that may contain grammatical issues as testing data to ensure that the testing set accurately reflects the challenges and characteristics of student writing. In light of this, we construct the testing set by collecting real sentences written by students and annotating them using human experts.

Original Sentences. We gather 810 sentences from the ETS Corpus of Non-Native Written English (Blanchard et al., 2014), a curated database of essays written by students for the TOEFL exam. To ensure high-quality annotations, we enlist five graduate students pursuing linguistics degrees as annotators for this task.

Annotation Scheme. The annotators are instructed to preserve both the sentence structure and semantics while providing only idiomatic expressions as suggestions. For each sentence, annotators have the option to provide an alternate sentence in which specific subparts are replaced with an idiomatic expression. If a sentence is deemed not amenable to enhancement, annotators can choose not to provide any annotation. Given that multiple IEs can be equivalent replacements for the same phrase, and some sentences may contain multiple replaceable subparts, we encourage annotators to provide multiple annotations for the same sentence whenever possible. Annotation statistics are included in Table 1.

4 Methods

Previous studies highlight the limitations of vanilla sequence-to-sequence models in preserving semantic content and grammatical fluency for the ISG task Zhou et al. (2021). In this paper, we explore two methods for the ISG-WI task and assess their performances on the IdEALS testing set.

Fine-tuning. We investigate the use of modern text-to-text models that are fine-tuned on the IdEALS dataset. Specifically, we employ the Parrot paraphraser model Damodaran (2021), which is based on the t5-base model from the T5 model family Raffel et al. (2019). By fine-tuning the Parrot model with our training set, we enable the model to generate idiomatic sentence suggestions.

To ensure the fulfillment of IdEALS task objectives, we employ postprocessing layers for grammar correction and verification of proper noun hallucinations and incorrect replacements and evaluate whether the task objectives are met with the following criteria. First, we assess semantic consistency between input and output sentences using large language models trained on NLI datasets, where a paraphrase is considered semantically consistent if it exhibits stronger entailment than neutrality. Second, we verify the presence of idiomatic expressions in the output by compiling a comprehensive pool of idioms from online sources. Finally, we assess the preservation of sentence structure by confirming that only one subpart of the original sentence is replaced in the paraphrase. Figure 2 presents a diagram illustrating the architecture of our proposed models. Additional model specifics and training details are provided in Appendix A.1.

To assess the efficacy of both the existing datasets and our model architecture, we train the backbone model using the PIE dataset Zhou et al. (2021) and the IdEALS dataset and compare the performances of these trained models.

Annotator	No Edits	Good Edits	Bad Edits
Annotator	No Edits	Good Edits	Adeq.	Corr.	Rich.	Total
Parrot + PIE	475	141	83	15	102	194
Parrot + PIE + Wrapper	671	124	10	0	5	15
Parrot + IdEALS	347	189	138	11	186	274
Parrot + IdEALS + Wrapper	601	182	22	3	4	27
gpt-3.5-turbo	91	461	15	111	191	258
text-davinci-003	394	160	16	4	240	254
Human experts	453	311	4	0	32	36

Table 3: Performance of Different Models and Human Experts. "+" denotes dataset usage or the checkbox module. Few-shot in-context learning is conducted with 10 in-context examples.

In-Context Learning. We also explore the application of in-context learning as an alternative approach. In-context learning involves keeping the parameters of a pre-trained language model fixed while providing it with a prompt containing task descriptions, examples, and a test instance. By leveraging the context, the model can learn patterns and make predictions based on the given task constraints. We construct the prompt by including task instructions and rules for the model to follow, followed by in-context examples sampled from the training set.

5 Results

5.1 Evaluation Metrics

In this section, we outline the scoring criteria used to evaluate the quality of annotations, whether generated by models or human annotators, using the testing set. Utilizing existing paraphrase metrics Liu et al. (2010); Patil et al. (2022); Shen et al. (2022), we establish three criteria to assess the quality of an annotation:

•

Adequacy: whether the semantic content of the original sentence is preserved in the output.
•

Correctness: whether the sentence structure is maintained, and no new grammatical mistakes are introduced in the output.
•

Richness: whether the provided suggestion incorporates an idiomatic expression.

Annotations that meet all three criteria are considered as good annotations, while annotations that violate more than one criterion are classified as bad annotations.

5.2 Performances

Here, we evaluate the abovementioned methods against annotations from human experts and present the results in Table 3.

Fine-tuning. Comparing the backbone model trained on the PIE dataset and the IdEALS dataset, we observe a significant increase in the number of annotations with the IdEALS dataset. In both cases, there is a reduction in erroneous annotations when postprocessing wrappers are used, but occasionally at the expense of a slight decrease in the number of good annotations.

In-context Learning. The two models exhibit distinct performance results. Notably, gpt-3.5-turbo surpasses human experts in generating good annotations but struggles with maintaining correctness. Conversely, text-davinci-003 falls short in producing good annotations compared to fine-tuned models on the IdEALS dataset and frequently violates the richness criteria.

Human Experts. We observe that experts exhibit a slight tendency to make richness mistakes but consistently deliver annotations of the highest quality compared to automated methods.

5.3 Error Analysis

Here, we analyze the prominent error cases for both methods. Error examples are included in Table 5 in the Appendix.

Fine-Tuning. The fine-tuned model exhibits two notable error cases: violating the adequacy condition due to semantic changes and suggesting trivial changes that lack idiomatic expressions. These errors are often observed in input sentences with syntactic errors or inappropriate phrase usage, highlighting the model’s vulnerability to mistakes when the input contains errors.

In-Context Learning. In-context learning methods tend to over-replace words with semantically equivalent phrases, resulting in insignificant changes compared to fine-tuned models. Addressing this issue proves challenging, even with additional rules in the prompt.

6 Conclusion

We introduced the ISG-WI task as a crucial step towards improving word choice in writing. We curate an extensive training set and real-world student-written texts annotated by human experts for our testing set. Our experimental results demonstrate the effectiveness of our dataset, with language models trained on the IdEALS dataset exhibiting superior performance in generating suggestions for idiomatic expressions compared to existing ISG datasets.

Ethical Considerations

Obtaining high-quality annotations for idiomatic sentences poses challenges as it necessitates expertise in language teaching and significant time commitment. We took measures to ensure that annotators were adequately compensated for their valuable contributions.

References

Blanchard et al. (2014) Daniel Blanchard, Joel Tetreault, Derrick Higgins, Aoife Cahill, and Martin Chodorow. 2014. ETS Corpus of Non-Native Written English.
Bowman et al. (2015) Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners.
Chollampatt et al. (2016) Shamil Chollampatt, Duc Tam Hoang, and Hwee Tou Ng. 2016. Adapting grammatical error correction based on the native language of writers with neural network joint models. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1901–1911, Austin, Texas. Association for Computational Linguistics.
Chollampatt and Ng (2018) Shamil Chollampatt and Hwee Tou Ng. 2018. A multilayer convolutional encoder-decoder neural network for grammatical error correction. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’18/IAAI’18/EAAI’18. AAAI Press.
Choshen et al. (2021) Leshem Choshen, Matanel Oren, Dmitry Nikolaev, and Omri Abend. 2021. Serrant: a syntactic classifier for english grammatical error types.
Damodaran (2021) Prithiviraj Damodaran. 2021. Parrot: Paraphrase generation for NLU.
Dolan and Brockett (2005) William B. Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005).
Ganitkevitch et al. (2013) Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. 2013. PPDB: The paraphrase database. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 758–764, Atlanta, Georgia. Association for Computational Linguistics.
Grundkiewicz et al. (2019) Roman Grundkiewicz, Marcin Junczys-Dowmunt, and Kenneth Heafield. 2019. Neural grammatical error correction systems with unsupervised pre-training on synthetic data. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 252–263, Florence, Italy. Association for Computational Linguistics.
Gupta et al. (2017) Ankush Gupta, Arvind Agarwal, Prawaan Singh, and Piyush Rai. 2017. A deep generative framework for paraphrase generation.
Hegde and Patil (2020) Chaitra Hegde and Shrikumar Patil. 2020. Unsupervised paraphrase generation using pre-trained language models.
Kham Thi and Nikolov (2021) Nang Kham Thi and Marianne Nikolov. 2021. How teacher and grammarly feedback complement one another in myanmar efl students’ writing. The Asia-Pacific Education Researcher, 31.
Kiyono et al. (2019) Shun Kiyono, Jun Suzuki, Masato Mita, Tomoya Mizumoto, and Kentaro Inui. 2019. An empirical study of incorporating pseudo data into grammatical error correction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1236–1242, Hong Kong, China. Association for Computational Linguistics.
Li et al. (2017) Zichao Li, Xin Jiang, Lifeng Shang, and Hang Li. 2017. Paraphrase generation with deep reinforcement learning.
Liu et al. (2010) Chang Liu, Daniel Dahlmeier, and Hwee Tou Ng. 2010. PEM: A paraphrase evaluation metric exploiting parallel texts. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 923–932, Cambridge, MA. Association for Computational Linguistics.
Mallinson et al. (2017) Jonathan Mallinson, Rico Sennrich, and Mirella Lapata. 2017. Paraphrasing revisited with neural machine translation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 881–893, Valencia, Spain. Association for Computational Linguistics.
McKeown (1979) Kathleen R. McKeown. 1979. Paraphrasing using given and new information in a question-answer system. In 17th Annual Meeting of the Association for Computational Linguistics, pages 67–72, La Jolla, California, USA. Association for Computational Linguistics.
Nadejde and Tetreault (2019) Maria Nadejde and Joel Tetreault. 2019. Personalizing grammatical error correction: Adaptation to proficiency level and L1. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), pages 27–33, Hong Kong, China. Association for Computational Linguistics.
Nie et al. (2020) Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. 2020. Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885–4901, Online. Association for Computational Linguistics.
Niu et al. (2020) Tong Niu, Semih Yavuz, Yingbo Zhou, Nitish Shirish Keskar, Huan Wang, and Caiming Xiong. 2020. Unsupervised paraphrasing with pretrained language models.
Omelianchuk et al. (2020) Kostiantyn Omelianchuk, Vitaliy Atrasevych, Artem Chernodub, and Oleksandr Skurzhanskyi. 2020. GECToR – grammatical error correction: Tag, not rewrite. In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 163–170, Seattle, WA, USA → Online. Association for Computational Linguistics.
Patil et al. (2022) Omkar Patil, Rahul Singh, and Tarun Joshi. 2022. Understanding metrics for paraphrasing.
Prakash et al. (2016) Aaditya Prakash, Sadid A. Hasan, Kathy Lee, Vivek Datla, Ashequl Qadir, Joey Liu, and Oladimeji Farri. 2016. Neural paraphrase generation with stacked residual LSTM networks. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 2923–2934, Osaka, Japan. The COLING 2016 Organizing Committee.
Raffel et al. (2019) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer.
Saxena and Paul (2020) Prateek Saxena and Soma Paul. 2020. Epie dataset: A corpus for possible idiomatic expressions.
Shen et al. (2022) Lingfeng Shen, Lemao Liu, Haiyun Jiang, and Shuming Shi. 2022. On the evaluation metrics for paraphrase generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3178–3190, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Tjong Kim Sang and De Meulder (2003) Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142–147.
Wang et al. (2017) Zhiguo Wang, Wael Hamza, and Radu Florian. 2017. Bilateral multi-perspective matching for natural language sentences.
Williams et al. (2018) Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
Wubben et al. (2010) Sander Wubben, Antal van den Bosch, and Emiel Krahmer. 2010. Paraphrase generation as monolingual translation: Data and evaluation. In Proceedings of the 6th International Natural Language Generation Conference. Association for Computational Linguistics.
Wubben et al. (2012) Sander Wubben, Antal van den Bosch, and Emiel Krahmer. 2012. Sentence simplification by monolingual machine translation. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1015–1024, Jeju Island, Korea. Association for Computational Linguistics.
Xu et al. (2018) Qiongkai Xu, Juyan Zhang, Lizhen Qu, Lexing Xie, and Richard Nock. 2018. D-page: Diverse paraphrase generation.
Zhou et al. (2021) Jianing Zhou, Hongyu Gong, and Suma Bhat. 2021. PIE: A parallel idiomatic expression corpus for idiomatic sentence generation and paraphrasing. In Proceedings of the 17th Workshop on Multiword Expressions (MWE 2021), pages 33–48, Online. Association for Computational Linguistics.
Zhou et al. (2020) Wangchunshu Zhou, Tao Ge, Chang Mu, Ke Xu, Furu Wei, and Ming Zhou. 2020. Improving grammatical error correction with machine translation pairs. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 318–328, Online. Association for Computational Linguistics.
Zomer and Frankenberg-Garcia (2021) Gustavo Zomer and Ana Frankenberg-Garcia. 2021. Beyond grammatical error correction: Improving L1-influenced research writing in English using pre-trained encoder-decoder models. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2534–2540, Punta Cana, Dominican Republic. Association for Computational Linguistics.

Set	Phrase Type	Example Sentence
Train	Idioms (Mono.)	$s$ : We’re not in sync. Listen carefully to what I am telling you.
	Idioms (Mono.)	$s^{\prime}$ : We’re not on the same page. Listen carefully to what I am telling you.
	Idioms (Poly.)	$s$ : I was excited when I found out that I’d gotten a good grade.
	Idioms (Poly.)	$s^{\prime}$ : I was as high as a kite when I found out that I’d gotten a good grade.
	Phrasal Verbs	$s$ : I was nearly failing the test when I got the extra credit question.
	Phrasal Verbs	$s^{\prime}$ : I was within an ace of failing the test when I got the extra credit question.
	Prep. Phrases	$s$ : I’m going to renew my home before my in-laws come to visit.
	Prep. Phrases	$s^{\prime}$ : I’m going to spruce up my home before my in-laws come to visit.
Test		$s$ : We become lethargic studying something for a long time.
		$s^{\prime}$ : We become heavy-eyed studying something for a long time.
		$s$ : So, people must rely on each other and must be active.
		$s^{\prime}$ : So, people must have each others’ backs and must be active.
		$s$ : To sum up, enjoying life is not related to the person’s age.
		$s^{\prime}$ : To sum up, making the most out of life is not related to the person’s age.

Table 4: Sentence pairs from the training and testing sets.

s

denotes the original sentence and

s^{\prime}

denotes the paraphrased sentence. The training set includes idiomatic expressions: idioms (monosemous, polysemous), phrasal verbs, and prepositional phrases. The original subphrase to be replaced is highlighted in red, and the replacement idiomatic expression is highlighted in green.

Method	Error Type	Example Sentence
FT	Incorrect Semantics	$s$ : To process a car is like a liberty of moving, you can do whatever you want in terms of transportation.
	Incorrect Semantics	$s^{\prime}$ : To process a car is like a piece of cake of moving, you can do whatever you want in terms of transportation.
	Trivial Change	$s$ : They know that if they give a false picture of their product in an advertisement, they might fool customers for a short term.
	Trivial Change	$s^{\prime}$ : They know that if they give a false picture of their product in an advertisement, they might fool customers for a short time.
ICL	Trivial Change	$s$ : By this, I mean that we are all concentrated on our projects and we work so hard that we do not have time to think about ourselves.
		$s^{\prime}$ : By this, I mean that we are all focused on our projects and we work so hard that we do not have time to think about ourselves.

Table 5: Example error cases of fine-tuning and in-context learning methods. "FT" refers to fine-tuned models and "ICL" refers to in-context learning methods. The original replaced subphrase is highlighted in red, and the erroneous replacemet is highlighted in yellow.

Appendix A Supplementary Material

A.1 Model Details

The backbone model is trained for 20 epochs using the PIE dataset and the IdEALS dataset on two NVIDIA GeForce RTX 3090 graphics cards. For the postprocessing wrappers, we utilize the LanguageTool API³³3https://languagetool.org/ for the Grammar Fixer module. The Fact Fixer module utilizes the bert-base model trained on the CoNLL-2003 Named Entity Recognition dataset Tjong Kim Sang and De Meulder (2003), while the Semantic Checker module employs the roberta-large model trained on the Stanford NLI dataset Bowman et al. (2015), the MultiNLI dataset Williams et al. (2018), and the Adversarial NLI dataset Nie et al. (2020). Both models are accessible at huggingface.co.

A.2 Dataset Samples

Examples from both the training and testing sets can be found in Table 5. Each idiomatic expression type collected in the training set is represented by one example. Note that the test set does not have the idiomatic expression type labeled.

A.3 Prompt for Sentence Pair Geneneration

Here, we include the prompt utilized for generating additional training samples for phrasal verbs and prepositional phrases.

[TASK DESCRIPTION] Generate 5 example sentences that use the PHRASE provided without altering the semantic content of the original sentence.

Example #1:
PHRASE: wet behind the ears
SENTENCES:
- terry, it turned out, was just out of university and wet behind the ears.
- the song is all about how he felt as a small town, wet behind the ears kid coming to la for the first time.
- hawking was a research student, still wet behind the ears by scientific standards.

Example #2:
PHRASE: narrow down
SENTENCES:
- I can’t decide what to wear, so I’m going to narrow down my options to three dresses.
- We only have a limited amount of time, so we need to narrow down our options.
- After doing some research, I was able to narrow down my list of colleges to five.

Example #3:
PHRASE: bounce back
SENTENCES:
- After his divorce, he was able to bounce back and start dating again.
- The company’s sales took a hit after the recession, but they were able to bounce back and return to profitability.
- He was disappointed when his team lost the championship game, but he was able to bounce back and win the next one.

----------------------PROMPT ENDS HERE----------------------
PHRASE:
SENTENCES:

A.4 Prompt for In-Context Learning

The full prompt for few-shot settings is included, consisting of 10 examples sampled from our training set. An example completion by the text-davinci-003 model is provided at the end.

[TASK DESCRIPTION] Enhance the input sentence by identifying phrases that can be replaced with a potentially idiomatic expression and output the replaced input sentence.
Rules:
1. Do not change the sentence besides replacing one phrase.
2. Only replace phrases with idiomatic expressions.
3. Do not alter the semantic content of the original sentence.
4. If there are no phrases replaceable, return "nan".

Example #1:
INPUT: The city was besieged for weeks, and the people were running out of food and water.
OUTPUT: the city was under siege for weeks, and the people were running out of food and water.

Example #2: INPUT: No matter what, we must ensure that our children have a bright future.
OUTPUT: at all costs, we must ensure that our children have a bright future.

Example #3:
INPUT: The train was travelling rapidly 200 kilometers per hour.
OUTPUT: the train was travelling at a speed of 200 kilometers per hour.

Example #4:
INPUT: Wow, man, this party is great!
OUTPUT: wow, man, this party is out of sight!

Example #5:
INPUT: In order to achieve success, she was willing to work long hours at the cost of her social life.
OUTPUT: in order to achieve success, she was willing to work long hours at the expense of her social life.

Example #6:
INPUT: We hope that by forming a bipartisan committee we will be able form a body that represents the most ideal circumstances.
OUTPUT: we hope that by forming a bipartisan committee we will be able form a body that represents the best of both worlds.

Example #7:
INPUT: The best way to eliminate the undesirables is to set high standards.
OUTPUT: The best way to weed out the undesirables is to set high standards.

Example #8:
INPUT: The top prize in the raffle is unclaimed.
OUTPUT: the top prize in the raffle is up for grabs.

Example #9:
INPUT: The news of the government’s corruption was the last straw, and people finally began to explode in protest.
OUTPUT: The news of the government’s corruption was the last straw, and people finally began to break out in protest.

Example #10:
INPUT: He refused to surrender even when faced with overwhelming odds.
OUTPUT: He refused to back down even when faced with overwhelming odds.

----------------------PROMPT ENDS HERE----------------------
INPUT: All people in a society can not be happy with the conditions or lifestyles that they are living in.
OUTPUT: All people in a society can not be on cloud nine with the conditions or lifestyles that they are living in.