Linguistics Theory Meets LLM: Code-Switched Text Generation via Equivalence Constrained Large Language Models

Garry Kuwanto¹, Chaitanya Agarwal², Genta Indra Winata^†3 , Derry Tanti Wijaya^†1,4
¹Boston University ²Deccan AI ³Capital One ⁴Monash University Indonesia
[email protected], [email protected]
[email protected], [email protected] The work was done outside the affiliation. ^†The authors are senior authors.

Abstract

Code-switching, the phenomenon of alternating between two or more languages in a single conversation, presents unique challenges for Natural Language Processing (NLP). Most existing research focuses on either syntactic constraints or neural generation, with few efforts to integrate linguistic theory with large language models (LLMs) for generating natural code-switched text. In this paper, we introduce EZSwitch, a novel framework that combines Equivalence Constraint Theory (ECT) with LLMs to produce linguistically valid and fluent code-switched text. We evaluate our method using both human judgments and automatic metrics, demonstrating a significant improvement in the quality of generated code-switching sentences compared to baseline LLMs. To address the lack of suitable evaluation metrics, we conduct a comprehensive correlation study of various automatic metrics against human scores, revealing that current metrics often fail to capture the nuanced fluency of code-switched text. Additionally, we create CSPref, a human preference dataset based on human ratings and analyze model performance across “hard” and “easy” examples. Our findings indicate that incorporating linguistic constraints into LLMs leads to more robust and human-aligned generation, paving the way for scalable code-switching text generation across diverse language pairs.¹¹1The code for EZSwitch is released at https://github.com/gkuwanto/ezswitch and CSPref is released at https://huggingface.co/datasets/garrykuwanto/cspref.

1 Introduction

Refer to caption — Figure 1: Example of EZSwitch. The top panel shows the word-level alignment between English and Hindi. The middle panel displays the input to the LLM, including the original English sentence and word constraints derived from ECT. The bottom panel shows the resulting code-switched output sentence generated by the LLM.

Code-switching, the practice of alternating between languages within a sentence or discourse, is prevalent in multilingual communities Myers-Scotton (1997). This phenomenon poses significant and long-standing challenges for natural language processing (NLP) tasks, such as natural language understanding, text generation, and speech processing. Large Language Models (LLMs) often struggle to effectively handle the complexity and variability inherent in code-switched language, resulting in suboptimal performance in these areas Winata et al. (2021); Zhang et al. (2023). Linguistic studies have shown that code-switching does not occur randomly but follows specific patterns or grammatical rules Poplack (1980); Myers-Scotton (1994). Leveraging this linguistic information helps predicting where code-switching naturally occurs. However, many of these linguistic constraints require complex grammatical structures as inputs, such as those from external word aligners or constituency parsers that can be difficult for models to process. In addition, constituency parsers are not always available and may introduce errors on distant languages Winata et al. (2019).

To address the complexity of learning code-switching patterns in NLP, existing approaches often treat code-switched language as a distinct language, developing new neural machine translation (NMT) models trained on non-code-mixed parallel data to generate code-switched data Winata et al. (2019); Gupta et al. (2021). These models learn in a supervised manner, eliminating the need for grammar parsers that might introduce errors. While these methods have shown promising results, they come with substantial computational costs and resource demands. Training a new NMT model from scratch to generate code-switched data is both expensive and time-consuming, making it impractical for resource-constrained environments or low-resource language pairs. Additionally, the robustness of NMT models can be compromised given the scarcity of the required parallel to code-switched training data.

One of the critical challenges in generating code-switched text is identifying valid switching points that adhere to the syntactic and semantic rules of the languages involved. The equivalence constraint theory (ECT) Poplack (1980) suggests that code-switching is permissible only at points where the grammatical structures of the languages align. However, applying this theory in practice is complex, particularly when the goal is to generate text that is both fluent and contextually appropriate. To bridge the gap, we propose to leverage linguistics constraints and incorporate them seamlessly on LLMs. Our contributions can be summarized in three-fold:

1.

We propose EZSwitch, a code-switching generation approach that effectively integrates ECT into LLM, generating code-switched text that is both syntactically and semantically valid. An example of EZSwitch is illustrated in Figure 1.
2.

We conduct a comprehensive high-quality evaluation of the generated samples via human annotation with native speakers. Given the lack of suitable automatic metrics for code-switching fluency, we employ human evaluations to assess fluency and accuracy, and analyze correlations with established metric.
3.

We release CSPref, a pairwise preference dataset designed to assess the fluency and accuracy of code-switched generated text. This dataset holds potential value for future research in preference tuning.

We release our code-switched data with human preference annotations to facilitate code-switching future research and evaluation.

2 Methodology

In this section, we provide an overview of linguistic theory and description of EZSwitch.

2.1 Equivalence Constraint Theory (ECT)

ECT posits that code-switching occurs only when it does not violate the syntactic rules of either language Poplack (1980). Winata et al. (2019) applies ECT by simplifying sentences in terms of a linear grammatical structure and allowing lexical substitution on non-crossing alignments between parallel sentences (e.g., lexical substitution between ”sentence” and ”vaaky” in Figure 1). Denoting $L_{1}$ as the source language and $L_{2}$ as the target language, given a sentence in $L_{1}$ comprising an array of words $u_{t}={a_{1},a_{2},...,a_{m}}$ and a corresponding sentence in $L_{2}$ comprising an array of words $v_{t}={b_{1},b_{2},...,b_{m}}$ , the alignment between $a_{i}$ and $b_{i}$ does not satisfy the constraint if there exists a pair $a_{j}$ and $b_{j}$ such that ( $a_{i}<a_{j}$ and $b_{i}>b_{j}$ ) or ( $a_{i}>a_{j}$ and $b_{i}<b_{j}$ ). If a switch occurs at this point, it alters the grammatical order in both languages, rendering the switch unacceptable. During the generation step, we permit any switches that do not violate this constraint.

2.2 Relaxed ECT

While adhering to the core principles of Equivalence Theory Poplack (1980), our approach applies and builds on the relaxed version of ECT introduced by Winata et al. (2019) (Section 2.1) to the task of code-switched sentence generation. This relaxation allows for greater flexibility in identifying potential switching points, accommodating the complexities of real-world code-switching patterns while maintaining grammatical coherence. Our implementation expands on the linear grammatical structure and non-crossing alignment criteria outlined in Section 2.1, introducing additional flexibility to capture a broader range of code-switching phenomena. In the following section, we outline our approach to implementing the relaxed ECT for identifying switching points, developing code-switched sentence generation techniques, and establishing a comprehensive evaluation framework.

2.3 EZSwitch

Building on the theoretical foundation established earlier, our implementation of ECT to generate code-switched sentence comprises of three main stages: 1) obtaining translations, 2) bitext alignment, and 3) code-switched sentence generation.

2.3.1 Obtaining Translations

We employ two distinct methods for obtaining translations:

Human Translations.

We utilize existing parallel datasets that contain human-generated translations. Specifically, for Hindi we use the HinGE dataset (Srivastava and Singh, 2021), and for Tamil and Malayalam we use the Samanantar (Ramesh et al., 2022) dataset. These datasets contain human-translated parallel sentences, providing a reliable gold standard for translations.

LLM Translations.

For LLM translations, we utilize Llama3 8B, a compact and open-source large language model. The translation process involves directly prompting the model. This approach allows us to generate translations using a consistent method across all input sentences and provides the silver standard for translations.

2.3.2 Bitext Alignment

After obtaining translations, we use the GIZA++ tool Och and Ney (2003) to align the sentences. We then apply the resulting alignments alongside the Equivalent Constraint Theory, as described in Section 2.1, to generate a list of potential switching points.

2.3.3 Code-Switched Sentence Generation

Once we identify switching points using the ECT approach, we guide language models to generate code-switched sentences by inputting the original monolingual sentence into the model. To construct the prompt effectively, we incorporate the following information:

Bilingual Context Setting.

We instruct the language model to act as a bilingual speaker of the two languages involved (e.g., Hindi and English).

Exemplar Provision.

We provide an example of a code-switched sentence along with its monolingual counterpart to demonstrate the desired style of output.

Constraint Application.

We specify key words from both languages, derived from the valid switching points (Section 2.1) that should appear in the code-switched output.

Generation.

The model generates a code-switched sentence based on the given context, example, constraints, and input sentence. We employ three different prompting methods in our experiments, each applied bidirectionally: A $\rightarrow$ CS (A and B), and B $\rightarrow$ CS (B and A), where the former is from language A to a code-switched text of A and B, and the latter is from language B to a code-switched text of B and A.

EZSwitch.

We use switching points identified with ECT using the silver translations generated by Llama3 8B. The model is given specific words from these LLM-generated translations to include in the code-switched output. This bidirectional approach across the three methods allows us to comprehensively explore code-switching patterns initiated from both languages, providing a robust comparison of the different prompting strategies.

3 Experimental Setup

3.1 Models

We evaluate our approach using three different LLMs: Aya23 Aryabumi et al. (2024), Llama3 8B and Llama3.1 8B Dubey et al. (2024). Aryabumi et al. (2024) introduce Aya 23 8B, an open-weight large language model specifically designed for multilingual tasks. It is trained on a diverse corpus of languages, making it particularly suitable for code-switching applications. Dubey et al. (2024) develop Llama3 8B, a compact yet powerful language model from the Llama3 series. With 8 billion parameters, it offers a balance between computational efficiency and performance, making it ideal for exploring code-switching in resource-constrained settings. Llama3.1 8B is an improved version of Llama3 8B. This model features refined training techniques and an updated dataset, incorporating the latest advancements in language modeling and potentially offering enhanced performance in multilingual and code-switching tasks. All experiments are run on a single NVIDIA L40 GPU with 48GB of memory.

3.2 Translation Directions

We consider two directions of code-switched generation in our experiments: English to Code-Switched and Indic Language (Hindi, Tamil, Malayalam) to Code-Switched. These directions apply to each language pair in our study, allowing us to examine code-switching patterns initiated from both languages.

3.3 Baselines

We do not use ECT-identified switching points. Instead, we simply instruct the model to create a code-switched version of the input sentence without specifying particular words to include. As another comparison, we We use switching points identified with ECT using the gold (human) translations from the parallel datasets (HinGE for Hindi, Samanantar for Tamil and Malayalam), we called it Human ECT. The model is given specific words from these human translations to include in the code-switched output.

3.4 Datasets

Our experiments utilize three parallel corpora, each corresponding to a distinct language pair, as detailed in Table 1. For the human translations employed in our Human ECT method, we rely on the parallel sentences available in these datasets. In contrast, for the LLM translations used in EZSwitch, we generate translations using the Llama3 8B model.

Language Pair	Source	Size
Hindi-English (hi-en)	HinGE	2,766
Tamil-English (ta-en)	Samanantar (WAT 2020)	2,000
Malayalam-English (ml-en)	Samanantar (WAT 2020)	2,000

Table 1: Summary of datasets used in the experiments, including source and size.

3.5 Evaluation

In our evaluation, we assess the effectiveness of various automatic metrics in capturing human judgment scores of fluency and accuracy of the code-switched text. Specifically, we experiment with COMET (Rei et al., 2020), permuting the reference language to gauge sensitivity across different reference settings. Additionally, we employ GPT-4o-mini as an evaluator to explore leveraging large language models (LLMs) for assessing code-switching, given the subjectivity inherent in human evaluations. This approach allows us to compare traditional automatic metrics with LLM-based evaluation in terms of alignment with human preferences.

COMET.

We compute COMET scores by using English (L1) and the Indic language (L2) as the reference language in separate runs, referred to as COMET_l1 and COMET_l2. To obtain a more comprehensive evaluation, we also take the average of these two scores (COMET_avg). This permutation helps in evaluating how robust COMET is across different reference configurations, capturing variability in code-switched translations. Among the metrics, comet_avg demonstrates the highest correlation with human evaluations, making it a more reliable option for code-switching assessment.

GPT4 Evaluation.

In addition to classical metrics, we conduct an evaluation using GPT4 as a proxy for human judgment. We provide GPT4 (specifically, gpt-4o-mini) with the exact same set of instructions as our human evaluators, asking it to rate the generated sentences for both accuracy and fluency. This setup allows us to directly compare the correlation between GPT4 ratings and human annotations, providing insight into the potential of large language models as evaluators.

3.6 Human Evaluation

We perform human evaluations of the code-switched sentences with native speakers who are bilingual and use code-switching in their daily communications. We adopt a setup inspired by Kuwanto et al. (2024) and reformulate the evaluation to focus on rating sentences rather than expressing a preference. Human evaluators, contractually employed by DeccanAI (previously, SoulAI) are asked to score the accuracy and fluency of the code-switched sentences. DeccanAI ethically recruits human evaluators in India, and evaluates them to determine their English and native language proficiency via their crowdsourcing platform. They also onboard evaluators, train them with the annotation guidelines, and obtain the annotation scores through their annotation platform.

This on 150 sample of inputs from dataset described in section 3.4 with 18 different generation settings. Totaling 2700 sentence to rate for each language. In total, we conduct 24,300 human evaluations in total. For each code-switched sentence, we ask 3 unique evaluators to score the accuracy and fluency of the sentence on a discrete scale from 1 (lowest) to 3 (highest). While evaluating, the evaluators can see (1) the English sentence, (2) the sentence in the respective native language, and (3) the LLM generated code-switched sentence.

4 Results

4.1 Automatic Metrics

We evaluate the entire dataset described in Table 1, each of the input pairs yielded 18 different generations from the different method and model. We use GPT-4o-mini, as traditional metrics often fail to align with human judgment for code-switched text. GPT-4o-mini was used to assess both Accuracy (GPT4o_a) and Fluency (GPT4o_f) for all model and method combinations. The results are split into translations from English input and Indic input to allow for a detailed comparison of each method’s performance across different language directions. Word Replacement (WR) refers to the method proposed in Winata et al. (2019), which we use for comparison against our proposed method.

From English Input
Method	Model	GPT4o_a	GPT4o_f
Baseline	WR	1.362	1.374
	Aya23	1.469	1.486
	Llama3	1.375	1.390
	Llama3.1	1.430	1.452
Human ECT	Aya23	1.446	1.449
	Llama3	1.354	1.364
	Llama3.1	1.498	1.512
EZSwitch	Aya23	1.377	1.472
	Llama3	1.336	1.345
	Llama3.1	1.477	1.485
From Indic Input
Baseline	WR	1.410	1.412
	Aya23	1.486	1.848
	Llama3	1.732	1.697
	Llama3.1	1.834	1.823
Human ECT	Aya23	1.525	1.775
	Llama3	1.451	1.445
	Llama3.1	1.850	1.852
EZSwitch	Aya23	1.528	1.788
	Llama3	1.460	1.450
	Llama3.1	1.843	1.849

Table 2: GPT-4 evaluation of Accuracy (GPT4o_a) and Fluency (GPT4o_f) for English-to-Indic and Indic-to-English translations across different models and methods. WR refers to the word replacement method from Winata et al. (2019). Best scores for each column within a language group are highlighted in bold.

4.2 Human Evaluation Results

We performed human evaluations on a sample of the entire dataset, using a random sample of 150 input sentences per Indic language. Given that we tested 3 methods, 3 models, and 2 translation directions (i.e., English $\leftrightarrow$ Indic), this resulted in a total of 2,700 ratings per Indic language. Table 3 shows the result of each of three primary methods: Baseline, Human ECT, and EZSwitch, evaluated across three models: Aya23, Llama3, and Llama3.1. We use human evaluation of Accuracy and Fluency ratings as our primary evaluation criteria, alongside automatic metrics such as COMET and GPT4-based scores (denoted as GPT4o_a for Accuracy and GPT4o_f for Fluency). The table is divided into two main sections based on the direction of translation: From English Input and From Indic Input, allowing us to compare the performance across different language directions. Each method and model combination is evaluated to identify which setup achieves the highest fluency and accuracy as judged by human evaluators. Additionally, COMET is included as a traditional automatic metric for comparison, while GPT4-based evaluations provide an alternative automated approach to assessing fluency and accuracy using large language models.

From English Input
Method	Model	Human Accuracy	Human Fluency	COMET	GPT4o_a	GPT4o_f
Baseline	Aya23	1.394	1.375	0.512	1.482	1.502
	Llama3	1.314	1.313	0.442	1.384	1.418
	Llama3.1	1.404	1.400	0.481	1.496	1.529
Human ECT	Aya23	1.332	1.377	0.483	1.458	1.584
	Llama3	1.306	1.301	0.463	1.396	1.456
	Llama3.1	1.421	1.400	0.499	1.527	1.573
EZSwitch	Aya23	1.313	1.369	0.473	1.418	1.522
	Llama3	1.327	1.285	0.451	1.396	1.458
	Llama3.1	1.424	1.417	0.491	1.504	1.556
From Indic Input
Baseline	Aya23	1.470	1.567	0.555	1.389	1.680
	Llama3	2.231	2.141	0.415	1.562	1.587
	Llama3.1	2.004	1.926	0.436	1.731	1.738
Human ECT	Aya23	1.583	1.613	0.567	1.433	1.656
	Llama3	1.801	1.711	0.428	1.393	1.442
	Llama3.1	1.888	1.837	0.513	1.702	1.782
EZSwitch	Aya23	1.538	1.605	0.566	1.438	1.644
	Llama3	1.756	1.691	0.435	1.384	1.413
	Llama3.1	1.904	1.833	0.510	1.713	1.791

Table 3: Mean scores of Human Accuracy, Human Fluency, COMET (comet_avg), and GPT4-based evaluations (GPT4o_a for Accuracy and GPT4o_f for Fluency) for Baseline, Human ECT, and EZSwitch methods across three models (Aya23, Llama3, and Llama3.1) with English and Indic as source languages. Scores are grouped by translation direction (English-Indic and Indic-English) to highlight the performance differences based on input language. Best scores for each column within a language group are highlighted in bold.

4.3 Correlation Between Automatic and Human Metrics

This subsection explores the relationship between various automatic evaluation metrics and human judgments for both Accuracy and Fluency in code-switched text generation. Understanding these correlations is crucial for determining the reliability of automatic metrics as proxies for human evaluation. Table 4 presents the Kendall’s Tau correlation scores between human ratings and several automatic metrics, including BLEU, COMET, and BERTScore, along with GPT4-based evaluations. Higher correlation values indicate a stronger alignment with human ratings, suggesting the suitability of a metric for capturing fluency and syntactic coherence in code-switched outputs. This experiment was done on the same 150 subset of input pairs from Section 4.2, which yielded 2,700 samples for each language, and we measure the correlations of these metrics with the average human ratings.

4.4 Pairwise Preference Dataset

We construct CSPref, a pairwise preference dataset using human ratings to evaluate the performance of different models in code-switched text generation. Each pair consists of two generated code-switched sentences compared based on their human-evaluated accuracy and fluency scores. To further analyze the performance, we split the dataset into easy and hard subsets. The easy subset includes pairs where the difference in human ratings is high (indicating a clear preference for one generated sentence), while the hard subset consists of pairs with minimal differences (indicating ambiguous preferences). Table 5 provides the statistics for the pairwise dataset across three languages: Hindi, Tamil, and Malayalam. We report the total number of pairs, as well as the breakdown into “easy” and “hard” subsets for each language pair.

	Accuracy	Fluency
Human Accuracy	1.000	0.768
Human Fluency	0.768	1.000
GPT4o_a	0.558	0.504
GPT4o_f	0.540	0.514
COMET_avg	0.246	0.290
COMET_l1	0.237	0.285
BLEU^∗	0.229	0.201
COMET_l2	0.175	0.202
BERTScore_l2_f1	0.121	0.204
BERTScore_l2_recall	0.119	0.200
BERTScore_l2_precision	0.117	0.199
BERTScore_l1_recall	0.073	0.126
BERTScore_l1_f1	0.071	0.128
BERTScore_l1_precision	0.064	0.120

Table 4: Kendall’s tau correlation scores between different automatic metrics and human evaluations for Accuracy and Fluency. ^∗BLEU score can only be calculated for Hindi-English as there is no code-switched references for other language pairs.

Language Pair	Total	Easy	Hard
Hindi-English (hi-en)	17,460	9,621	7,839
Tamil-English (ta-en)	5,034	4,506	528
Malayalam-English (ml-en)	8,664	7,517	1,147

Table 5: Statistics of CSPref for three languages (Hindi, Tamil, and Malayalam). “Easy” pairs are defined as those with high rating differences, while “Hard” pairs are defined as those with low rating differences.

5 Discussion

5.1 Model Performance

Our experimental results, as presented in Table 2, demonstrate that even small, open-source models like Llama3 8B can produce high-quality code-switched sentences when guided by linguistic constraints. The ECT-guided generation outperformed the baseline in both fluency and grammatical accuracy, illustrating that linguistic theory can effectively enhance the outputs of language models. Notably, Llama3.1 8B consistently achieved higher accuracy and fluency scores compared to other models, highlighting the importance of the language model’s architecture and parameterization in code-switching generation.

5.2 Method Effectiveness

The three generation methods (Baseline, Human ECT, and EZSwitch) offer valuable insights based on human evaluation. EZSwitch outperforms Human ECT for translations from English input, demonstrating better fluency and syntactic coherence. This suggests that EZSwitch excels at handling English-to-Indic code-switching. For from Indic input, human evaluations show that EZSwitch and Human ECT perform similarly, but both underperform compared to the Baseline. This may be due to the abundance of data where Indic languages act as the matrix language and English as the embedded language, making LLMs more adept at handling this type of code-switching, as they already possess knowledge of such patterns. When assessed using automatic metrics, Human ECT performed best across all measures, highlighting its ability to capture linguistic nuances that LLM-generated translations might miss. However, EZSwitch still showed significant improvement over the Baseline, demonstrating that ECT-guided approaches can be applied effectively in low-resource scenarios. These automatic metric results underscore the need for metrics that better reflect the complexities of code-switching.

5.3 Directional Asymmetry

An unexpected finding was the directional asymmetry in model performance between different translation directions. Specifically, translations from Indic languages to English exhibited significantly higher fluency and accuracy than translations in the reverse direction. This asymmetry may be attributed to the training data distribution, where code-switching from Indic languages to English is more prevalent in real-world usage, making the model more familiar with this direction. Additionally, all annotators being native speakers of Indic languages and L2 learners of English might introduce bias. Their proficiency in their native language allows for more rigorous scrutiny of translations into Indic languages, potentially leading to stricter evaluations of fluency and accuracy. Conversely, translations from Indic languages to English might be rated more leniently, as English is a second language for the annotators, making them less critical of subtle errors. This directional asymmetry could also reflect real-world code-switching patterns that favor English as the matrix language, with Indic languages filling lexical gaps. Future research could explore how this asymmetry relates to linguistic theories, such as the Matrix Language Frame model, and examine the impact of annotators’ bilingual proficiency on these evaluations.

5.4 Evaluation Challenges and Human Preference Correlation

Evaluating code-switching fluency and accuracy is challenging due to the limitations of existing automatic metrics. Based on Table 4, while SentenceBLEU reduces reliance on human-generated references, it falls short in measuring fluency. Embedding-based metrics like BERTScore and COMET, although better at capturing semantic similarity, exhibit weak correlations with human evaluations, around 0.2. In contrast, GPT-4o-mini, when used as an evaluator with structured guidance, achieves a higher correlation (Kendall’s tau of 0.5). This indicates that LLMs can effectively approximate human judgment, albeit with potential stylistic biases. These findings underscore the need for specialized code-switching metrics that integrate syntax, semantics, and context, emphasizing naturalness and linguistic validity over traditional n-gram-based matching. Developing tailored metrics is essential for accurately assessing multilingual text generation, as conventional tools fail to capture the complexities inherent in code-switching.

6 Related Work

Code-switching has been extensively studied from both linguistic and computational perspectives. Early linguistic theories, such as ECT (Poplack, 1980), establishes foundational principles for understanding syntactic boundaries in code-switching. Similarly, research by Joshi (1982) and Pfaff (1979) examine structural constraints and sentence processing in bilingual contexts. Recent computational approaches have adapted these theories into neural models. For instance, Winata et al. (2019) utilized ECT to generate synthetic data for training language models, while Gupta et al. (2020) employed pre-trained models to create code-switched text without explicit constraints. Pratapa and Choudhury (2021) utilized ECT to synthetically generate code-switched text by using the Dependency Tree. And Gupta et al. (2021) adopted a Machine Translation approach to the problem. Comprehensive survey by Sitaram et al. (2019); Winata et al. (2023) outline the computational challenges and advancements in code-switching research.

Evaluation benchmarks, such as LinCE (Aguilar et al., 2020) and GLUECoS (Khanuja et al., 2020) have standardized model assessments across diverse tasks. Recent studies have also investigated automatic metrics for code-switching (Guzmán et al., 2017) and explored the use of LLMs in understanding code-switched text (De Leon et al., 2024), and also generating (Yong et al., 2023). In this context, our work builds on these foundations by integrating linguistic constraints into LLM-based generation, addressing existing limitations in fluency and accuracy evaluation.

7 Future Work

In our preliminary analysis of code-switching metrics, using methods like those proposed by Guzmán et al. (2017), we find significant variability in how annotators judged sentences, with different distributions of metrics across individuals. This variability is further highlighted by the I-index distribution for all annotators, as shown in Figure 2. These findings suggest that individual preferences and linguistic backgrounds play a critical role in the evaluation of code-switching fluency. Future work should delve into the demographic data of annotators, exploring how factors like age, region, and language proficiency influence code-switching preferences. A more detailed study on how these demographic factors affect evaluation consistency could improve the robustness of human-centered code-switching metrics. In addition to exploring demographic influences, another promising direction is training LLMs to align with human preferences, conditioned on demographic data. By using our human evaluation dataset, we could develop personalized code-switching models that adapt to individual linguistic preferences and switching habits. Conditioning on demographics like age, region, and language proficiency, this approach could create a more tailored and context-aware code-switching LLM, ultimately improving the fluency and relevance of code-switched text generation for diverse users. It is also worthwhile to explore the dialectical variations in code-switching languages that exhibit different language styles Aji et al. (2022); Winata et al. (2024b). Developing a new metric aligned with human preferences for natural-sounding code-switching text would greatly benefit future code-switching research Winata et al. (2024a, c).

8 Conclusion

We propose EZSwitch, a method for generating high-quality code-switched text by integrating Equivalence Constraint Theory (ECT) with large language models (LLMs). Our approach addresses a significant gap in code-switching generation by combining linguistic theoretical constraints with modern language models. Through a series of experiments using both human evaluations and automatic metrics, we demonstrate that our method produces more fluent and accurate code-switching sentences compared to baseline LLMs and previous works. Furthermore, we constructed a Human Preference dataset to capture human preferences to help future work with aligning to codeswitching. Our correlation analysis of automatic metrics, such as COMET, reveals that existing metrics are insufficient for evaluating code-switching text, and alternative approaches such as GPT-based evaluation achieve better alignment with human judgments.

Limitations

Our study is intentionally limited in scope, focusing exclusively on the evaluation of open-source language models to ensure reproducibility and transparency in our findings. By concentrating on these models, we aim to establish a solid foundation for future research that can be easily replicated by other scholars in the field. Looking ahead, we plan to extend our methodology to include commercial language models, which may provide enhanced capabilities and performance. This expansion will enable us to explore the potential benefits and challenges of integrating proprietary models into our framework. Additionally, we are considering the inclusion of more languages in future studies to further enrich our research and broaden its applicability across diverse linguistic contexts.

Ethics Statement

All aspects of this research were reviewed and approved by the Institutional Review Board of our organization. Data collection was conducted by DeccanAI. We compensate human evaluators INR 110 for every 18 sentences they evaluate, which typically takes around 20 minutes. This results in an effective pay rate of INR 330 per hour. The human evaluators work entirely remotely and interact with DeccanAI through their web platform. All evaluators are native speakers of the respective Indic languages they assess and are proficient in English. Their language proficiency is evaluated through custom online tests. Most evaluators come from major cities in India where these native languages are spoken and frequently engage in code-switched dialogues. DeccanAI provides training for the evaluators to ensure they are well-calibrated with the annotation guidelines.

References

Aguilar et al. (2020) Gustavo Aguilar, Sudipta Kar, and Thamar Solorio. 2020. Lince: A centralized benchmark for linguistic code-switching evaluation. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1803–1813.
Aji et al. (2022) Alham Fikri Aji, Genta Indra Winata, Fajri Koto, Samuel Cahyawijaya, Ade Romadhony, Rahmad Mahendra, Kemal Kurniawan, David Moeljadi, Radityo Eko Prasojo, Timothy Baldwin, et al. 2022. One country, 700+ languages: Nlp challenges for underrepresented languages and dialects in indonesia. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7226–7249.
Aryabumi et al. (2024) Viraat Aryabumi, John Dang, Dwarak Talupuru, Saurabh Dash, David Cairuz, Hangyu Lin, Bharat Venkitesh, Madeline Smith, Kelly Marchisio, Sebastian Ruder, et al. 2024. Aya 23: Open weight releases to further multilingual progress. arXiv preprint arXiv:2405.15032.
De Leon et al. (2024) Frances Adriana Laureano De Leon, Harish Tayyar Madabushi, and Mark Lee. 2024. Code-mixed probes show how pre-trained models generalise on code-switched text. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 3457–3468.
Dubey et al. (2024) Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783.
Gupta et al. (2021) Abhirut Gupta, Aditya Vavre, and Sunita Sarawagi. 2021. Training data augmentation for code-mixed translation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5760–5766.
Gupta et al. (2020) Deepak Gupta, Asif Ekbal, and Pushpak Bhattacharyya. 2020. A semi-supervised approach to generate the code-mixed text using pre-trained encoder and transfer learning. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2267–2280.
Guzmán et al. (2017) Gualberto A Guzmán, Joseph Ricard, Jacqueline Serigos, Barbara E Bullock, and Almeida Jacqueline Toribio. 2017. Metrics for modeling code-switching across corpora. In Interspeech, pages 67–71.
Joshi (1982) Aravind Joshi. 1982. Processing of sentences with intra-sentential code-switching. In Coling 1982: Proceedings of the Ninth International Conference on Computational Linguistics.
Khanuja et al. (2020) Simran Khanuja, Sandipan Dandapat, Anirudh Srinivasan, Sunayana Sitaram, and Monojit Choudhury. 2020. Gluecos: An evaluation benchmark for code-switched nlp. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3575–3585.
Kuwanto et al. (2024) Garry Kuwanto, Eno-Abasi E Urua, Priscilla Amondi Amuok, Shamsuddeen Hassan Muhammad, Anuoluwapo Aremu, Verrah Otiende, Loice Emma Nanyanga, Teresiah W Nyoike, Aniefon D Akpan, Nsima Ab Udouboh, et al. 2024. Mitigating translationese in low-resource languages: The storyboard approach. arXiv preprint arXiv:2407.10152.
Myers-Scotton (1994) Carol Myers-Scotton. 1994. Social motivations for codeswitching. evidence from africa. Multilingua-Journal of Interlanguage Communication, 13(4):387–424.
Myers-Scotton (1997) Carol Myers-Scotton. 1997. Duelling languages: Grammatical structure in codeswitching. Oxford University Press.
Och and Ney (2003) Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational linguistics, 29(1):19–51.
Pfaff (1979) Carol W Pfaff. 1979. Constraints on language mixing: Intrasentential code-switching and borrowing in spanish/english. Language, pages 291–318.
Poplack (1980) Shana Poplack. 1980. Sometimes i’ll start a sentence in spanish y termino en espanol: toward a typology of code-switching1. Linguistics, 18(7-8):581–618.
Pratapa and Choudhury (2021) Adithya Pratapa and Monojit Choudhury. 2021. Comparing grammatical theories of code-mixing. In Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), pages 158–167.
Ramesh et al. (2022) Gowtham Ramesh, Sumanth Doddapaneni, Aravinth Bheemaraj, Mayank Jobanputra, Raghavan Ak, Ajitesh Sharma, Sujit Sahoo, Harshita Diddee, Divyanshu Kakwani, Navneet Kumar, et al. 2022. Samanantar: The largest publicly available parallel corpora collection for 11 indic languages. Transactions of the Association for Computational Linguistics, 10:145–162.
Rei et al. (2020) Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. 2020. Comet: A neural framework for mt evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702.
Sitaram et al. (2019) Sunayana Sitaram, Khyathi Raghavi Chandu, Sai Krishna Rallabandi, and Alan W Black. 2019. A survey of code-switched speech and language processing. arXiv preprint arXiv:1904.00784.
Srivastava and Singh (2021) Vivek Srivastava and Mayank Singh. 2021. Hinge: A dataset for generation and evaluation of code-mixed hinglish text. In Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems, pages 200–208.
Winata et al. (2023) Genta Winata, Alham Fikri Aji, Zheng Xin Yong, and Thamar Solorio. 2023. The decades progress on code-switching research in nlp: A systematic survey on trends and challenges. In Findings of the Association for Computational Linguistics: ACL 2023, pages 2936–2978.
Winata et al. (2024a) Genta Indra Winata, David Anugraha, Lucky Susanto, Garry Kuwanto, and Derry Tanti Wijaya. 2024a. Metametrics: Calibrating metrics for generation tasks using human preferences. arXiv preprint arXiv:2410.02381.
Winata et al. (2021) Genta Indra Winata, Samuel Cahyawijaya, Zihan Liu, Zhaojiang Lin, Andrea Madotto, and Pascale Fung. 2021. Are multilingual models effective in code-switching? In Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching, pages 142–153.
Winata et al. (2024b) Genta Indra Winata, Frederikus Hudi, Patrick Amadeus Irawan, David Anugraha, Rifki Afina Putri, Yutong Wang, Adam Nohejl, Ubaidillah Ariq Prathama, Nedjma Ousidhoum, Afifa Amriani, et al. 2024b. Worldcuisines: A massive-scale benchmark for multilingual and multicultural visual question answering on global cuisines. arXiv preprint arXiv:2410.12705.
Winata et al. (2019) Genta Indra Winata, Andrea Madotto, Chien-Sheng Wu, and Pascale Fung. 2019. Code-switched language models using neural based synthetic data from parallel sentences. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 271–280.
Winata et al. (2024c) Genta Indra Winata, Hanyang Zhao, Anirban Das, Wenpin Tang, David D Yao, Shi-Xiong Zhang, and Sambit Sahu. 2024c. Preference tuning with human feedback on language, speech, and vision tasks: A survey. arXiv preprint arXiv:2409.11564.
Yong et al. (2023) Zheng Xin Yong, Ruochen Zhang, Jessica Forde, Skyler Wang, Arjun Subramonian, Holy Lovenia, Samuel Cahyawijaya, Genta Winata, Lintang Sutawika, Jan Christian Blaise Cruz, et al. 2023. Prompting multilingual large language models to generate code-mixed texts: The case of south east asian languages. In Proceedings of the 6th Workshop on Computational Approaches to Linguistic Code-Switching, pages 43–63.
Zhang et al. (2023) Ruochen Zhang, Samuel Cahyawijaya, Jan Christian Blaise Cruz, Genta Winata, and Alham Aji. 2023. Multilingual large language models are not (yet) code-switchers. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12567–12582.

Appendix A Switching Point Algorithm

The algorithm for the process of getting valid switching points is as described in Algorithm 1

Result: List of valid switching points

GetValidSwitchingPoints(pairs)

valid_pairs

\leftarrow

[ ];

for $i\leftarrow 1$ to $\text{length}(pairs)$ do

valid

\leftarrow

true;

for $j\leftarrow 1$ to $\text{length}(pairs)$ do

(a_{i},b_{i})\leftarrow pairs[i]

;

(a_{j},b_{j})\leftarrow pairs[j]

;

if $(a_{i}<a_{j}\text{ and }b_{i}>b_{j})$ or $(a_{i}>a_{j}\text{ and }b_{i}<b_{j})$ then

valid

\leftarrow

false;

break;

end if

end for

if valid then

Append

pairs[i]

to valid_pairs;

end if

end for

return valid_pairs;

Algorithm 1 Identification of Valid Switching Points

Appendix B Statistical Analysis of Human Evaluation

B.1 Overall Effects

We conducted a series of one-way ANOVAs to investigate the impact of three primary factors—Model, Method, and Direction—on both Accuracy and Fluency scores. The results, presented in Table 6, indicate that all factors have a statistically significant effect on both measures (all p ¡ .001).

Specifically, the Model factor shows a strong effect on Accuracy ( $F=43.71,p<.001$ ) and a moderate effect on Fluency ( $F=18.82,p<.001$ ), suggesting that different models handle the nuances of code-switched text differently. The Method factor also significantly influences both Accuracy ( $F=9.66,p<.001$ ) and Fluency ( $F=20.80,p<.001$ ), highlighting the importance of incorporating linguistic constraints during generation. Finally, the Direction factor has the highest impact on both Accuracy ( $F=323.13,p<.001$ ) and Fluency ( $F=293.31,p<.001$ ), indicating a substantial difference in translation quality between English-Indic and Indic-English directions. These findings underscore the necessity of considering all three factors when evaluating code-switching generation methods.

Factor	Accuracy		Fluency
	F-score	p-value	F-score	p-value
Model	43.71	¡.001	18.82	¡.001
Method	9.66	¡.001	20.80	¡.001
Direction	323.13	¡.001	293.31	¡.001

Table 6: Results of one-way ANOVAs for examining the effects of Model, Method, and Direction on Accuracy and Fluency scores. Each factor shows a significant impact on both measures (all p ¡ .001), indicating that the choice of model, the generation method, and the translation direction significantly influence the overall quality of code-switched text generation.

B.2 Model Comparison

For the model comparisons, we evaluate the performance of Llama3, Llama3.1, and Aya23 on English-Indic and Indic-English translation tasks, focusing on both Accuracy and Fluency. The detailed results for English-Indic translation are presented in Table 7. For accuracy, Llama3.1 significantly outperforms both Aya23 and Llama3, while no significant difference is observed between Aya23 and Llama3. In terms of fluency, Llama3 outperforms Aya23, and Llama3.1 outperforms both Llama3 and Aya23, though the difference between Aya23 and Llama3.1 is not statistically significant.

For the Indic-English translation, the comparisons show that both Llama3 and Llama3.1 significantly outperform Aya23 in terms of accuracy, but there is no significant difference between Llama3 and Llama3.1. For fluency, both Llama3 and Llama3.1 again outperform Aya23, with a slight significant difference between Llama3.1 and Llama3, as shown in Table 7.

English to Indic Translation
Model Comparison	Measure	Mean Diff.	p-value
Llama3.1 vs. Aya23	Accuracy	0.1881	¡ .001
Llama3.1 vs. Llama3	Accuracy	0.1859	.0001
Aya23 vs. Llama3	Accuracy	0.0022	.9986
Llama3 vs. Aya23	Fluency	0.1267	.0034
Llama3.1 vs. Llama3	Fluency	0.1859	¡ .001
Aya23 vs. Llama3.1	Fluency	0.0593	.2813
Indic to English Translation
Llama3 vs. Aya23	Accuracy	0.3363	¡ .001
Llama3.1 vs. Aya23	Accuracy	0.4326	¡ .001
Llama3.1 vs. Llama3	Accuracy	0.0963	.0807
Llama3 vs. Aya23	Fluency	0.1889	¡ .001
Llama3.1 vs. Aya23	Fluency	0.2881	¡ .001
Llama3.1 vs. Llama3	Fluency	0.0993	.0452

Table 7: Tukey’s post-hoc test results for model comparisons across English-Indic and Indic-English translation directions. The table reports the mean differences and p-values for Accuracy and Fluency between Aya23, Llama3, and Llama3.1 models, highlighting significant variations in performance depending on the model used.

B.3 Method Comparisons

The method comparisons analyze Human ECT, baseline approaches, and EZSwitch for English-Indic and Indic-English translation. Table 8 presents the results. For English to Indic translation, Human ECT significantly outperforms Baseline in terms of accuracy, but the difference between Baseline and EZSwitch, as well as between Human ECT and EZSwitch, is not significant. In terms of fluency, both Human ECT and EZSwitch outperform the Baseline significantly, with no significant difference observed between Human and EZSwitch.

For Indic to English translation, Human ECT and EZSwitch both outperform the Baseline for accuracy, and there is no significant difference between Human ECT and EZSwitch. Similarly, for fluency, both methods show significant improvements over Baseline, but no substantial difference between Human and EZSwitch is observed, as shown in Table 8

English to Indic Translation
Method Comparison	Measure	Mean Diff.	p-value
Human ECT vs. Baseline	Accuracy	0.1348	.0058
Baseline vs. EZSwitch	Accuracy	0.0993	.0597
Human ECT vs. EZSwitch	Accuracy	0.0356	.6939
Human ECT vs. Baseline	Fluency	0.1578	.0002
EZSwitch vs. Baseline	Fluency	0.1319	.0022
Human ECT vs. EZSwitch	Fluency	0.0259	.7843
Indic to English Translation
Human ECT vs. Baseline	Accuracy	0.1281	.0157
EZSwitch vs. Baseline	Accuracy	0.1496	.0036
Human ECT vs. EZSwitch	Accuracy	0.0215	.8881
Human ECT vs. Baseline	Fluency	0.2022	¡ .001
EZSwitch vs. Baseline	Fluency	0.1763	.0001
Human ECT vs. EZSwitch	Fluency	0.0259	.8101

Table 8: Tukey’s post-hoc test results for method comparisons across English-Indic and Indic-English translation directions. The table presents the mean differences and p-values for Human ECT, EZSwitch, and Baseline methods, indicating significant variations in Accuracy and Fluency depending on the generation approach used.

B.4 Direction Comparison

We also compare the direction of translation (English-Indic vs. Indic-English) to determine if there are asymmetries in model performance. The results in Table 9 indicate that Indic to English translation significantly outperforms English to Indic translation for both Accuracy and Fluency, with mean differences of 0.4684 and 0.4037, respectively (p ¡ .001 for both measures).

Measure	Mean Difference	p-value
Accuracy	0.4684	¡ .001
Fluency	0.4037	¡ .001

Table 9: Tukey’s post-hoc test results for mean differences and p-values in Accuracy and Fluency between Indic-to-English and English-to-Indic translation directions. Significant differences indicate that the direction of translation has a substantial impact on performance, with Indic-to-English outperforming English-to-Indic in both metrics.

Appendix C Prompt Details

In table 10 we present the specific prompts used across different methods in our code-switching experiments. The Translate prompt is used when we generate the translations to get alignment. The Baseline prompt is used when we evaluate LLM to generate codeswitching. The ECT prompt is used for both EZSwitch and Human ECT. The GPT Eval prompt defines the structure for evaluating code-switching output based on accuracy and fluency.

Method

Prompt

Translate

Translate the following lang1 sentence to lang2:

Baseline

You are a Bilingual lang1-lang2 speaker, you will help translate these lang1

sentences into a code-mixed sentence with Romanized lang2 and lang1

EZSwitch

You are a Bilingual lang1-lang2 speaker, you will help translate these lang1

sentences into a code-mixed sentence with Romanized lang2 and lang1

with specific keywords that should to appear.

Words wanted: <List of Words>

GPT Eval

You are provided with triplets of sentences. The first two sentence in each triplet

is the original monolingual sentences. The second sentence is a generated

code-switched sentence. Your task is to evaluate the generated sentence based

on two criteria: Accuracy and Fluency. You will score each criterion on a scale

from 1 to 3, where 1 is the lowest and 3 is the highest. When evaluating the

generated sentences, focus on the content and meaning. Ignore any extra

formatting, alignment artifacts, or additional explanatory text. Judge the

sentence to determine its accuracy and fluency.

original_l1: <Original Lang1 Sentence>

original_l2: <Original Lang2 Sentence>

generated: <Code Switched Sentence>

Table 10: Prompts used in our experiment.

Appendix D Human Evaluation

D.1 Annotation Guidelines

The following guidelines are provided to human evaluators to assess the model’s responses. Evaluators rate the generated sentences based on two criteria: Accuracy and Fluency. The original sentence is in English, Indian local languages (Hindi, Malayalam, and Tamil), and evaluators must adhere to the rubrics outlined below.

D.1.1 General Guidelines

•

MUST: Be objective while rating the responses.
•

MUST: Strictly follow the rubrics for Accuracy and Fluency evaluation.
•

Score each criterion on a scale from 1 to 3, where 1 is the lowest and 3 is the highest.
•

Ignore formatting, and any additional explanatory text generated by the language model. Only focus on meaning and context.
•

If the model fails to generate a response, assign a score of 1 for both Accuracy and Fluency.

D.1.2 Accuracy

Accuracy measures how well the generated sentence preserves the meaning and information of the original sentence and whether the code-switched terms are used correctly. The scores are as follows:

•
1. Low Accuracy:
- –
  
  Significant deviations from the original meaning.
- –
  
  Key information is missing, altered, or repeated redundantly.
- –
  
  Code-switched terms are incorrect or inappropriate.
- –
  
  Introduces new information not present in the original sentence.
- –
  
  Key details are altered or repeated redundantly.
•
2. Moderate Accuracy:
- –
  
  Minor deviations from the original meaning.
- –
  
  Most key information is present but may have slight errors.
- –
  
  Most code-switched terms are appropriate but with minor mistakes.
•
3. High Accuracy:
- –
  
  Preserves the original meaning fully.
- –
  
  All key information is present and correct.
- –
  
  Code-switched terms are accurate and appropriately used.

D.1.3 Fluency

Fluency measures how natural and easy to understand the generated sentence is, considering grammar, syntax, and the smooth integration of code-switching. The scores are as follows:

•
1. Low Fluency:
- –
  
  The sentence is difficult to understand or awkward.
- –
  
  Poor grammar or syntax in either language.
- –
  
  Code-switching disrupts the flow of the sentence.
•
2. Moderate Fluency:
- –
  
  The sentence is understandable but may have awkward or unnatural phrasing.
- –
  
  Acceptable grammar and syntax in both languages.
- –
  
  Code-switching is somewhat smooth but not perfectly integrated.
•
3. High Fluency:
- –
  
  The sentence is natural and easy to understand.
- –
  
  Good grammar and syntax in both languages.
- –
  
  Code-switching is smooth and seamless, enhancing the sentence flow.

D.2 Inter Annotator Agreement

As seen in Table 11, the inter-annotator agreement, measured by Krippendorff’s alpha, reveals varying levels of consensus across languages, with the highest agreement for Hindi. While Fluency is generally lower, this is expected as Fluency is more of a subjective measure.

Throughout the evaluation process, we continuously monitored the quality of annotations by measuring inter-annotator agreement at regular intervals. If the agreement metric indicated significant divergence in scores, particularly when individual annotators’ ratings deviated notably from the group consensus, we conducted alignment meetings. These meetings were used to clarify the guidelines and ensure a consistent understanding of the evaluation criteria among the annotators. During these sessions, any inconsistencies were discussed and resolved to improve consistency, especially in subjective aspects like Fluency. This iterative process helped ensure the reliability of the final evaluations and minimized discrepancies in the ratings.

Language	Fluency	Accuracy
Tamil	0.321	0.445
Malayalam	0.405	0.423
Hindi	0.646	0.720

Table 11: Krippendorff’s alpha scores for inter-annotator agreement on Fluency and Accuracy across Tamil, Malayalam, and Hindi.