Building Dialogue Understanding Models for Low-resource Language Indonesian from Scratch

Donglin Di [email protected] Advance.AI , Weinan Zhang [email protected] Harbin Institute of Technology , Yue Zhang [email protected] Westlake University and Fanglin Wang [email protected] Advance.AI

(2023)

Abstract.

Making use of off-the-shelf resources of resource-rich languages to transfer knowledge for low-resource languages raises much attention recently. The requirements of enabling the model to reach the reliable performance lack well guided, such as the scale of required annotated data or the effective framework. To investigate the first question, we empirically investigate the cost-effectiveness of several methods to train the intent classification and slot-filling models for Indonesia (ID) from scratch by utilizing the English data. Confronting the second challenge, we propose a Bi-Confidence-Frequency Cross-Lingual transfer framework (BiCF), composed by “BiCF Mixing”, “Latent Space Refinement” and “Joint Decoder”, respectively, to tackle the obstacle of lacking low-resource language dialogue data. Extensive experiments demonstrate our framework performs reliably and cost-efficiently on different scales of manually annotated Indonesian data. We release a large-scale fine-labeled dialogue dataset (ID-WOZ) and ID-BERT of Indonesian for further research.

dialogue datasets, intent classification, slot-filling, Indonesian

^†^†copyright: acmcopyright^†^†journalyear: 2023^†^†doi: 10.1145/3575803^†^†ccs: Computing methodologies Neural networks^†^†ccs: Computing methodologies Discourse, dialogue and pragmatics

1. Introduction

It is generally accepted that the neural dialogue understanding model relies heavily on large scale of training data (Liu et al., 2018). Existing works have been conducted mostly on rich-resource languages such as English(Wen et al., 2016; Lowe et al., 2015) and Chinese(Wu et al., 2016). But thousands of low-resource languages in this world lacks training data. It is impractical and cost-ineffective to collect and annotate enough large-scale datasets for low-resource languages(Grave et al., 2018) to train the dialogue understanding models. Therefore, as shown in Fig. 1, it remains huge challenges on how to efficiently adapt existing research resources and findings to low-resource languages (e.g., Indonesian (ID)), so that the needs for understanding the multilingual task-oriented dialogue(Schuster et al., 2018; Schuster et al., 2019) can be addressed effectively.

Refer to caption — Figure 1. The investigation of utilizing off-the-shelf resources and models for generating low-resource language dialogue understanding models from scratch.

One challenge is how to make use of off-the-shelf resources of rich-resource languages effectively (e.g., English (Budzianowski et al., 2018)). The intuitive method is to use a neural machine translation system (Vaswani et al., 2018; Cheng, 2019) to translate the English dataset into Indonesian, and then train the dialogue understanding models on the translated data. Another strategy is to utilize multilingual word embeddings (Devlin et al., 2018; Pires et al., 2019), which allows the dialogue model trained on the English dataset to be directly applied to Indonesian since the pre-trained multilingual model contains vocabulary of both them. Each of the above methods has its own strengths and limitations. The former allows us to use language-specific pre-trained model (Peters et al., 2018; Devlin et al., 2018; Yang et al., 2019) for better representation, which could be obtained from training on large-scale unlabelled general documents or web pages. However, methods of this branch suffer from errors in machine translation and invalid dislocated annotations from source corpus, which could significantly influence the subsequent dialogue modeling (i.e., slot-filling). The latter suffers from intrinsic differences between English and Indonesian, including variations in syntactic and semantic patterns.

The second challenge relies on how to transfer existing models to the target low-resource language. Cross-lingual transfer learning (Schuster et al., 2018) can potentially alleviate the limitations of both methods. One possible approach is to align the contextual word embeddings or sentence-level encoding in the semantic latent space (Schuster et al., 2019). It can avoid semantic misunderstanding and syntactic mistakes. However, this method can be impacted by imperfect alignments and implementation of it is complex, which leads to applying or deploying slowly.

In this work, we propose a Bi-confidence-frequency Cross-lingual Transfer framework (BiCF), overcoming challenges illustrated above. For the first challenge, we adopt the word-level alignment strategy (Zhang et al., 2019), which has been demonstrated as effective as phrase-level alignment yet much simpler and more stable (Tiedemann, 2015; Tiedemann and Agić, 2016). Specifically, the first stage of BiCF is Bi-confidence-frequency Mixing, utilizing the English dataset to generate code-mixed data, which avoids sentence-level translation errors as well as labels dislocation. The mixed data not only takes the importance-frequency and translating-confidence into consideration but also contains gold annotations for Indonesian from English datasets. And for the second challenge, in our framework, Latent Space Refinement and Joint Decoder are designed on the top of resulting high-quality mixed data, utilizing and refining the pre-trained off-the-shelf word embedding models, to train the dialogue understanding models (i.e., intent classification and slot-filling) for Indonesian eventually.

To conduct extensive experiments for Indonesian, we follow the method of MultiWOZ (Budzianowski et al., 2018), which is a large-scale task-oriented English dialogue dataset, to collect and annotate a counterpart and richer-domain dataset in Indonesian (ID), named ID-WOZ. Extensive experiments show that the proposed framework can achieve comparable and better performance with less gold annotated data. We quantify the influence of each factor in our experiments.

The main contributions of this paper are summarized as follows:

•

We propose a framework (BiCF) to utilize English dataset for training Indonesian dialogue understanding models and achieve good performance on Indonesian dialogue dataset.
•

We release a large-scale manually annotated multi-domain ID-WOZ dialogue dataset, together with a pre-trained ID-BERT model as the resource contributions for low-resource language dialogue understanding tasks.
•

We investigate the demand for annotated data of well-performing dialogue understanding models, which thereby may instruct related research works on collecting datasets or training models for other low-resource languages.

2. Related Work

Low-resource Language. Exisiting works (Guo et al., 2015; Duong et al., 2015; Ammar et al., 2016), and (Wang et al., 2017) work on multilingual parsing on low-resource languages, which are helpful to improve the low-resource language understanding. Wang et al. (Wang et al., 2017) propose to integrate English syntactic knowledge into a parser trained on the Singlish treebank, and shows that it is reasonable to leverage English to improve low-resource language models.
Cross-lingual Transfer. Artetxe et al. (Artetxe et al., 2017) proposes a self-learning framework and a small size of word dictionary to learn a mapping between source and target word embeddings. Zhang et al. (Zhang et al., 2019) focuses on improving dependency parsing by mixing confident target words into source treebank. Schuster et al. (Schuster et al., 2019) utilizes Multilingual CoVe embeddings obtained from Machine Translation systems (McCann et al., 2017) in Thai and Spanish for zero-shot dependency parsing. Relying on aligned parallel sentence pairs suffers from noise and imperfect alignments (Zhang et al., 2019). In line with these methods, encoding the semantic information directly within the same cross-lingual latent space could avoid semantic misunderstanding from machine translation or wrong alignments.

3. ID-WOZ

There has been a lack of available datasets for training natural dialogue understanding systems in regional low-resource languages, such as Indonesian (Chowanda and Chowanda, 2017; Koto, 2016; Tho et al., 2018). As a result, we build our ID-WOZ dataset from scratch. The detailed statistics is reported in Experiment Section. In particular, we organize a structured annotation scheme with structured semantic labels, drawing on the experiences of previous work (Williams et al., 2013; Asri et al., 2017; Eric and Manning, 2017; Shah et al., 2018; Budzianowski et al., 2018).

3.1. Coverage

ID-WOZ is constructed with the goal of obtaining highly natural conversations between a customer and an agent or a query information center focusing on daily life. We consider various possible dialogue scenarios ranging from basic requests like hotel, restaurant, to a few urgent situations such as hospital or police. Our dataset consists of nine domains, namely plane, taxi, wear, restaurant, movie, hotel, attraction, hospital, police, most of which are extended domains including the sub-task Booking (with the exception of police).

3.2. Collection and Annotation

We adopt the Wizard-of-Oz (Kelley, 1984) dialogue-collecting approach, which has been shown effective for obtaining a high-quality corpus at relatively low costs and with a small-time effort. Following the successful experience of MultiWOZ (Budzianowski et al., 2018), we create a large-scale corpus of natural human-human conversations on a similar scale. Based on the given templates for several domains, the users and wizards generate conversations using heuristic-based rules to prevent the overflow of information. We design and develop a collection-annotation pipeline platform with a user-friendly structure for building up the dataset, shown in Experiment Section.

In order to accelerate and optimize the process of collection and annotating, we design and develop a pipeline platform. Our platform consists of three stages, “collection - annotation - statistics & analysis”, which are executed synchronously after the initialization process. We split a number of well-trained annotators (i.e., 80 local people, 70 of them whose mother language are ID, 10 of them are bilingual citizens, plus 2 main organizers) into two groups to produce dialogue and annotation. A quarter of annotators (i.e., 20) are trained following the templates we provide to play the wizard role. After collecting 1k dialogues initially (about one week), while the collecting conversation is still ongoing, the second group of annotators (i.e., 62) join in to work towards the detailed full-labeled corpus, including domains, actions, intents, and slots.

The quality is assured in three processes, namely “scripts checking”, “cross-checking”, and “supervisor-checking”. In particular, the scripts can filter the hypothesis cases which have potential faults such as vacant labels or malformed. For cross-checking process, the annotators are assigned not only to fresh unlabeled annotation tasks but also a few sampling labeled cases (i.e., 20%) from their peers, in a double-blind process. The cases passing the cross-checking procedure will be sampled and handed over to the supervisors (i.e., the two organizers), who are more familiar with the details of the entire annotation task to control the overall consistency and accuracy of the annotation. We adopt the inter-annotator agreement (IAA) (Fleiss et al., 1969) to measure how well our recruited annotators can make the same annotation decision for a certain category further, as follows:

(1)

\kappa\equiv\frac{p_{o}-p_{e}}{1-p_{e}}=1-\frac{1-p_{o}}{1-p_{e}}

where $p_{o}$ and $p_{e}$ denote the relative observed agreement among raters, and hypothetical probability of chance agreement, respectively. The average score of our dataset is 0.834.

3.3. Statistics and Analysis

Table 1 compares our dataset with existing datasets in Indonesian, and the English dialogue dataset MultiWOZ (Budzianowski et al., 2018). ID Chat (Koto, 2016) is the first publicly available Indonesian chat corpora, and draw a few related research on the Indonesian Language dialogue (Chowanda and Chowanda, 2017). Dyadic Chat (Tho et al., 2018) is another public chat corpus on Indonesian, which focuses on the dyadic term. Dyadic is a term that describes the relationship between two people, for instance, the romantic relationship between a couple. Compared with these small-scale datasets, ID-WOZ is the first to contain large-scale (about ten thousand dialogues in multi-domains) corpus focusing on general task-oriented chat.

MultiWOZ (Budzianowski et al., 2018) is a large-scale multi-domain task-oriented English dialogue dataset, including seven distinct domains (taxi, restaurant, hotel, attraction, hospital, police, train) and fine-labeled actions and slots in the spoken language understanding stage. Considering the regional cultural background, our collected dataset contains a few more general domains (i.e., wear, movie, plane) and more corresponding slots type, such as clothes type, movie genre, movie synopsis.

Dataset	ID Chat	Dyadic Chat	MultiWOZ	ID-WOZ
Domains	None	None	7	9
Language	ID	ID	En	ID
Total # dials	300	79	8, 438	9, 189
Total # tokens	150, 000	3164	1, 520, 970	1, 551, 591
Total # utters	1, 000	158	142, 974	251, 184
Avg. # turns	3	3	13.68	13.67
Avg. # slots	-	-	25	8.8

Table 1. Comparison of ID-WOZ with other related datasets in several statistics metrics.

4. ID-BERT

Even though there are several off-the-shelf pre-trained BERT models for rich-resource languages such as English and Chinese, pre-trained language-specific models for low-resource languages like Indonesian, is still not available to our knowledge. We release a pre-trained model for Indonesian named ID-BERT as another resource contribution. Although most related work (Schuster et al., 2019; Pires et al., 2019) relies on the pre-trained Multilingual-BERT model and fine-tunes it for low-resource languages, our main goal is to build up a relatively reliable dialogue system and to experiment how to overcome the gap between language-specific BERT and the multilingual-BERT. Therefore, we put in effort to train a ID-BERT for comparing appearance on spoken language understanding or furthermore tasks. We pre-train a BERT for Indonesian from scratch by about 3.3 billion tokens from Indonesian websites document-level corpus, which covers news reports, research papers, daily articles, and other text genres. The size of our ID-BERT vocabulary is 0.9M, which is much larger than Multilingual-BERT (0.12M). We believe that this size of the vocabulary is sufficient to cover most of the scenarios of daily multi-domain task-oriented dialogue in Indonesian. The training takes one week by using Google Cloud TPU v3_8, and our ID-BERT (Cased, L=12, H=768, A=12) are eventually obtained.

5. BiCF Cross-lingual Transfer

We here illustrate our proposed pipeline framework – “BiCF Cross-lingual Transfer” (BiCF) in detail. It mainly consists of three components, namely “BiCF Mixing”, “Latent Space Refinement”, and “Joint Decoder”. As shown in Fig. 2, the BiCF mixing step replaces a few English words with Indonesian. Then we train and refine the cross-lingual semantic embedding latent space based on the mixed data with gold annotations from English dataset. Finally, we adopt the combination of BiLSTM and CRF to decode the intent and slots jointly.

5.1. BiCF Mixing

The first stage of our framework is “Bi-confidence-frequency Mixing” (BiCF Mixing). As shown in Fig. 2, we use the source language data in two steps. The first is to generate the frequency-word set ( $\textbf{W}_{freq}$ ) of source data. The second is to obtain the word alignment with the translating-confidence ( $\lambda_{conf}$ ) of each word and generate confidence-word set ( $\textbf{W}_{conf}$ ). The goal of this stage is to select both frequent and high-confidence word pairs for English and Indonesian, and yield mixed data $\mathcal{T}_{mix}$ .

Given the set of source sentences $\mathcal{S}=\{s_{1},s_{2},...,s_{n}\}$ , we calculate TF-IDF (Salton et al., 1982; Ramos et al., 2003) for each word in the source dialogue corpus, as shown in Eq. 2:

(2)

\displaystyle\left\{\begin{matrix}tf_{(i,j)}&=\frac{\mathcal{N}(w_{i}^{s},s_{j})}{\sum_{k}\mathcal{N}(w_{k}^{s},s_{j})}\\ idf_{(i)}&=\log\frac{\left|S\right|}{1+\left|j:w_{i}^{s}\in d_{j}\right|}\\ tf\texttt{-}idf_{(i,j)}&=tf_{(i,j)}\times idf_{(i)}\\ \end{matrix}\right.

where $\mathcal{N}(w_{i}^{s},s_{j})$ is the number of occurrences of the word $w_{i}^{s}$ in the source language sentence $s_{j}$ , and the denominator ( $\sum_{k}\mathcal{N}(w_{k}^{s},s_{j})$ ) is the sum of occurrences of all terms $w_{k}^{s}$ in the sentence $s_{j}\in\mathcal{S}$ . $\left|S\right|$ represents the number of sentences and $\left|j:w_{i}^{s}\in d_{j}\right|$ denotes the number of sentences containing the word $w_{i}^{s}$ . The frequency-word set $\mathbf{W}_{freq}=\langle(w_{i}^{s},r_{i}),...,(w_{j}^{s},r_{j})\rangle$ are obtained by sorting the output ( $tf\texttt{-}idf_{(i,j)}$ ) from the TF-IDF algorithm, where $r_{i}$ denotes the frequency score.

Algorithm 1 BiCF Mixing

\mathcal{S}

\mathbf{W}_{freq}

\lambda_{freq}

\mathbf{W}_{conf}

\lambda_{conf},\theta

\mathcal{T}_{mix}

\widehat{\mathbf{W}}_{freq}\leftarrow

Thresh(

\mathbf{W}_{freq},\lambda_{freq}

)

\widehat{\mathbf{W}}_{conf}\leftarrow

Thresh(

\mathbf{W}_{conf},\lambda_{conf}

)

\widetilde{\mathbf{W}}_{sub}\leftarrow

Fusion(

\widehat{\mathbf{W}}_{freq},\widehat{\mathbf{W}}_{conf},\theta

)

\mathcal{T}_{mix}\leftarrow

\Phi

7:for

s\in\mathcal{S}

\widehat{s}\leftarrow

s

9: for

w^{s}\in s

10: if

w^{s}\in\widetilde{\mathbf{W}}_{sub}

then

11:

w^{t}\leftarrow

Get(

\widetilde{\mathbf{W}}_{sub},w^{s}

)

12:

\widehat{s}\leftarrow

Mixing(

\widehat{s},w^{s},w^{t}

)

13: end if

14: end for

15:

\mathcal{T}_{mix}\leftarrow

\mathcal{T}_{mix}\cup\widehat{s}

16:end for

17:return

\mathcal{T}_{mix}

Then we adopt small-scale high-quality parallel sentences (i.e., 1K), translated by skilled bilingual translators, to generate the alignments of words by using fast_align (Dyer et al., 2013). Given a few English sentences and their corresponding confidently translated sentences, the fast_align model uses a log-linear reparameterization of IBM Model 2 (Collins, 2011) to generate a set of confidence-word pairs $\mathbf{W}_{conf}={\langle(w_{i}^{s},w_{i}^{t}),p_{i}\rangle,...,\langle(w_{j}^{s},w_{j}^{t}),p_{j}\rangle}$ with Indonesian word and confidence score, denoted by $w_{i}^{t}$ and $p_{i}$ , respectively.

As shown in Algorithm 1, after selecting the words both over the frequency threshold $\lambda_{freq}$ and the confidence threshold $\lambda_{conf}$ , we then fusing words to generate the substitute words set $\mathbf{W}_{sub}$ . $Thresh$ function of line 1 and 2 in Algorithm 1 are designed as Eq. 3:

(3)

\widehat{\mathbf{W}}=Sort(\mathbf{W}(\cdot),\mathcal{P}(\cdot))\odot\lambda_{(\cdot)}

where $\mathbf{W}(\cdot)$ denotes frequency-word set ( $\mathbf{W}_{freq}$ ) or confidence-word set ( $\mathbf{W}_{conf}$ ). $\mathcal{P}(\cdot)$ denotes frequency scores $r_{i}$ or confidence score $p_{i}$ . $\odot$ is selecting the top subset operation. And $Fusion$ function in line 3 can be implemented as Eq. 4:

(4)

\widetilde{\mathbf{W}}=(\widehat{\mathbf{W}}_{freq}\odot\theta)\cap(\widehat{\mathbf{W}}_{conf}\odot(1-\theta))

where $\theta$ is the hyper-parameter to adjust the ratio of two branch of word sets. Lines 4 to 13 in Algorithm 1 illustrate the mixing procedure. We incrementally substitute the source word $w^{s}$ of a temporarily copied sentence $\widehat{s}$ with the corresponding target word $w^{t}$ . In this way, the mixed corpus $\mathcal{T}_{mix}$ is generated, consisting of both English words and Indonesian words.

5.2. Cross-lingual Space Refinement

We train and refine the initially pre-trained multilingual model (i.e., Multilingual-BERT) on the mixed corpus $\mathcal{T}_{mix}$ with annotations from the source English dataset. This operation could update the embeddings of English words as well as the Indonesian words. Therefore this stage allows our model to make use of English corpora and obtain a refined latent space to improve semantic representations. The multilingual latent space can be updated with the discriminative training process as Eq. 5:

(5)

\left\{\begin{matrix}\Theta_{i+1}^{l}=\Theta_{i}^{l}-\eta^{l}\cdot\nabla_{\Theta^{l}}J(\Theta)\\ \eta^{i-1}=\xi\cdot\eta^{i}\end{matrix}\right.

where $\eta^{l}$ denotes the learning rate of the $l$ -th layer. $\Theta_{i}^{l}$ represents the parameters of the model at $l$ -th layer in $i$ step. $\nabla_{\Theta^{l}}J(\Theta)$ is the gradient of parameters $\Theta_{i}^{l}$ at $l$ -th layer with regard to the model’s objective function, i.e., supervised by intent classification and slot-filling annotations in our model.

When the performance is stable on the training set (around 25 epochs in our experiments), we save the model that performs best on the validation set as the mixed refined embedding model, denoted in blue embedding space in the middle of Fig. 2. Then we feed fine-labeled Indonesian data into the mixed refined embedding model and transfer one more time to obtain a refined target-specific embedding model. In this way, by utilizing the English dataset, we generate a better representation latent space for Indonesian, i.e., encoding each sentence into $\mathbb{R}^{1\times 768}$ representation feature vector.

5.3. Joint Decoder

The decoder of our framework performs two tasks, i.e., intent classification and slot-filling sequence labeler, respectively. We make use of deep bi-directional long short-term memory (Bi-LSTM) networks and a CRF layer, as shown in Eq. 6, to predict the classifications for the input words (Chen et al., 2017; Wang et al., 2017; Chen et al., 2016; Dozat and Manning, 2016).

(6)

h_{i}=[f_{l}(\overleftarrow{h_{i+1}},x_{i}),f_{r}(\overrightarrow{h_{i-1}},x_{i})]\Rightarrow BiLSTM

where $f_{l}$ is the hidden state of backward propagation and $f_{r}$ is the hidden state of forward feeding in BiLSTM, respectively. And then CRF layer is appended to decode slot classes further and generate results of the framework finally.

6. Experiments

6.1. Dataset and Evaluation

We take MultiWOZ (Budzianowski et al., 2018) as the English dataset and our collected ID-WOZ as the target language Indonesian dataset. As the hospital and police domains in MultiWOZ contain very few dialogues (5% of total dialogues), and only appear in the training dataset, we choose to ignore them in our experiments, following (Wu et al., 2019). The train domain is invalid in Indonesian data because it reflects the cultural difference between English and Indonesia. Therefore, we only adopt four domains as the main experiment restaurant, hotel, taxi, attraction shared by MultiWOZ and ID-WOZ. Statistics of them are shown in Table 2. In order to suit the testing set, we have to merge the annotations of English data with Indonesian dataset, thereby abandoning a few types of labels, such as reference, choice in MultiWOZ. After processing, the statistics for the four domains in two datasets are reported in Table 2. Contrastive study of differences between our dataset and similar well-known datasets are shown in Table 3 All of the experiments are evaluated on the same test set from ID-WOZ (1K dialogues, 250 dialogues for each domain), which suits the local cultural background. We use the F1 score as the evaluation metric, which is calculated by the precision rate and recall rate.

Dataset	Domains	# Sentences	# Slots	# Intent
MultiWOZ	Restaurant	62, 703	28, 351	41, 177
	Hotel	64, 284	25, 985	42, 434
	Taxi	48, 080	7, 160	28, 976
	Attraction	55, 186	21, 004	34, 053
ID-WOZ	Restaurant	28, 095	5, 809	22, 312
	Hotel	30, 865	8, 720	24, 694
	Taxi	28, 178	6, 038	22, 168
	Attraction	36, 523	9, 198	29, 513

Table 2. Statistics for total number in four domains.

Dataset	Twitter	Ubuntu	Sina Weibo	WOZ 2.0	Frames	M2M	MultiWOZ	ID-WOZ
Domains	Unrestricted	Ubuntu	Unrestricted	Unrestricted	Unrestricted	Unrestricted	7	9
Language	English	English	Chinese	English	English	English	English	Indonesian (+En)
Total # dialogues	1.3M	930K	4.5M	600	1, 369	1, 500	8, 438	9,189 (+1k)
Total # tokens	-	-	-	50, 264	251, 867	121, 977	1, 520, 970	1, 551, 591
Avg. # Turns	2.10	7.71	2.3	7.45	14.60	9.86	13.68	13.67
Avg. # slots	-	-	-	4	61	14	25	8.8

Table 3. Comparison of our dataset to similar well-known datasets.

6.2. Model Settings

There are three branches of methods to utilize English dataset and pre-trained models, i.e., Machine Translation based (MT); Multilingual pre-trained embedding model with English corpus (MLEn); and our proposed BiCF.

1) MT. We adopt the machine translation preprocessing method and extract word embeddings (i.e., $\mathbb{R}^{1\times 768}$ ) by random initiation, pre-trained multilingual-BERT (ML-BERT) and ID-BERT. We also take Indonesian-fastText (ID-fastText) (Joulin et al., 2016), Transformer (Vaswani et al., 2017) and our pre-trained Indonesian-Word2vec (ID-Word2vec) into comparison. ( $\mathbb{R}^{1\times 300}$ )

2) MLEn. We adopt three pre-trained multilingual word embedding models in this baseline, namely multilingual fastText (ML-fastText) (Joulin et al., 2016), multilingual Word2vec (ML-Word2vec) (de Melo, 2017), and multilingual-BERT (ML-BERT) (Devlin et al., 2018). By extracting the embeddings of MultiWOZ and ID-WOZ, we encode each sentence into $\mathbb{R}^{1\times 300}$ , $\mathbb{R}^{1\times 300}$ and $\mathbb{R}^{1\times 768}$ dimensions, respectively.

3) BiCF. We generate about 1.5K confident word pairs from MultiWOZ and 1K translated parallel sentences. For our method BiCF, the training process converges after 20 epochs. It reaches 91.13, 87.84/ 90.17, 82.09/ 93.37, 82.98/ 89.55, 85.54 for the F1 score of intent classification and slot-filling on the MultiWOZ validation set of restaurant, hotel, taxi, attraction domains, respectively. And then the Indonesian training data of ID-WOZ is fed to refine the Indonesian embedding model.

6.3. Development Experiments

We feed 16K Indonesian sentences of ID-WOZ to each method and validate their performance on same test set of ID-WOZ. In our implementation, five-fold cross-validation is employed to investigate the optimal parameter setting within training datasets ( $learning\_rate=e^{-3},batch\_size=64,dropout\_rate=0.1,optimizer=sgd$ ). To verify the stability of the proposed method, we run the experiments five times for each set of parameter settings and compare their mean performance, reported in Table 4.

		Restaurant		Hotel		Taxi		Attraction
		Intent	Slots	Intent	Slots	Intent	Slots	Intent	Slots
MT	Random Init	85.48	74.36	82.73	73.49	89.15	80.22	89.64	86.26
MT	ID-fastText	86.03	75.27	83.17	74.03	89.82	80.28	90.02	86.88
MT	ID-Word2vec	88.22	76.70	86.33	74.11	89.91	81.81	91.55	86.90
MT	Transformer	90.13	79.91	91.89	74.27	90.25	82.11	92.85	87.16
MT	ML-BERT	91.63	79.22	92.52	73.83	91.20	82.34	93.77	87.31
MT	ID-BERT	92.37	81.88	93.78	75.79	91.76	83.59	94.07	89.63
MLEn	ML-fastText	86.00	76.11	83.10	74.91	89.22	80.88	90.31	86.93
MLEn	ML-Word2vec	88.22	77.70	86.33	74.11	89.91	81.81	91.55	86.90
MLEn	ML-BERT	90.42	79.79	92.01	74.28	90.47	82.91	93.18	87.77
BiCF	ML-fastText	86.21	76.16	83.31	75.01	90.24	82.58	90.84	87.23
BiCF	ID-fastText	87.08	76.34	84.21	75.79	90.83	82.920	91.52	87.67
BiCF	ML-Word2vec	88.80	77.91	87.12	74.24	90.01	82.87	91.58	87.03
BiCF	ID-Word2vec	88.92	78.84	88.52	74.35	90.31	83.15	91.82	87.49
BiCF	ML-BERT	92.92^†	82.84^†	94.30^†	76.95^†	92.23^†	90.45^†	94.80^†	90.44^†
BiCF	ID-BERT	93.02^†	82.91^†	94.73^†	77.15^†	92.73^†	91.03^†	94.88^†	90.74^†

Table 4. Experimental comparison on ID-WOZ dataset. (“

{\dagger}

” denotes the significance testing,

p

value<0.05

We also conduct a series of experiments by feeding batches of annotated Indonesian data (i.e., 1K sentences, 2K sentences, 4K sentences, …, full-scale). We pick the results of restaurant, hotel, taxi, and attraction domains in Fig. 3 and Fig. 4, as they are widely usable domains and have the most scale of dialogue data and annotations both in MultiWOZ and ID-WOZ. The entire annotated dataset, experiment results and codes are detailedly reported in Table 5 and Code 1. Also, we conduct a comparison experiment for Multilingual-BERT (ML-BERT) and ID-BERT on all domains of full-scale ID-WOZ, as reported in Table 6.

Methods	ID-WOZ	Restaurant		Hotel		Taxi		Attraction
Methods	ID-WOZ	Intent	Slots	Intent	Slots	Intent	Slots	Intent	Slots
MT (ID-BERT)	ID-WOZ-1000	87.33	56.67	90.83	60.14	86.66	40.28	90.01	62.29
	ID-WOZ-2000	88.97	59.74	91.67	66.63	86.98	59.88	91.02	76.05
	ID-WOZ-4000	90.01	70.67	93.23	69.35	89.50	74.09	93.05	83.73
	ID-WOZ-8000	91.67	80.57	93.65	73.75	90.63	82.09	93.95	87.96
	ID-WOZ-16000	92.37	81.88	93.78	75.79	91.76	83.59	94.07	89.63
	ID-WOZ-All	92.25	81.87	93.42	75.65	91.67	82.17	94.25	90.40
MLEn (ML-BERT)	ID-WOZ-1000	84.11	55.59	89.73	60.41	82.93	22.66	89.32	64.44
	ID-WOZ-2000	86.57	56.86	91.51	65.26	86.37	40.63	91.57	70.56
	ID-WOZ-4000	89.57	68.99	91.90	72.20	87.93	46.42	92.58	84.22
	ID-WOZ-8000	90.93	73.37	93.42	75.15	88.08	58.63	94.03	86.85
	ID-WOZ-16000	90.92	74.24	93.28	75.89	88.12	64.12	94.11	87.67
	ID-WOZ-All	90.89	75.36	93.23	75.97	88.34	64.86	94.25	88.71
BiCF (ML-BERT)	ID-WOZ-1000	84.23	59.92	87.66	59.81	84.78	72.31	87.87	69.41
	ID-WOZ-2000	86.69	66.67	90.35	61.93	86.69	75.25	90.04	80.05
	ID-WOZ-4000	89.07	76.10	91.77	68.85	88.82	81.87	92.72	85.52
	ID-WOZ-8000	92.23	78.34	93.13	73.71	91.55	86.48	93.46	88.41
	ID-WOZ-16000	92.92	82.84	94.30	76.95	92.23	90.45	94.80	90.44
	ID-WOZ-All	92.60	82.67	94.24	76.91	92.25	89.43	94.77	90.45
ID-BERT	ID-WOZ-All	92.22	82.14	93.91	76.88	91.97	88.13	93.96	90.20

Table 5. Performance comparison of different methods on the selected MultiWOZ and ID-WOZ with different amounts of feeding ID-WOZ data.

Domains	ML-BERT		ID-BERT
Domains	Intent	Slots	Intent	Slots
Restaurant	91.07	77.68	92.22	82.14
Hotel	92.78	74.91	93.91	76.88
Taxi	90.84	82.91	91.97	88.13
Attraction	93.25	88.04	93.96	90.20
Plane	91.36	92.77	93.42	93.11
Police	90.02	88.89	92.78	90.07
Movie	90.57	86.14	91.76	87.98
Hospital	92.64	84.15	93.85	86.09
Wear	90.77	87.02	91.80	88.34

Table 6. Experimental comparison of ML-BERT and ID-BERT on full-scale ID-WOZ.

6.4. Results Analysis

The results of the method in Section 6.2 are shown in Table 5, with English data of MultiWOZ and 16k Indonesian data of ID-WOZ. The method of machine translation based methods (MT + ML-BERT/ ID-BERT) surpass multilingual model with English data (MLEn + ML-BERT) on the intent classification task, outperforming by about 1.21%, 1.95%; 0.51%, 1.77%; 0.73%, 1.29% and 0.59%, 0.89% on F1 score for restaurant, hotel, taxi, attraction, respectively. The main reason is that the machine translation methods enjoy much more Indonesian sentences with corresponding intention labels. However, on the slot-filling task, the machine translation methods are weaker. Because the machine translation methods suffer from invalid or mismatching labels after translation. Overall, our proposed framework (BiCF + ML-BERT / ID-BERT) performs better than others in both tasks, as we are capable of utilizing the English intention labels and correct slot-filling annotations effectively. And from Table 6, we can see that ID-BERT outperforms ML-BERT across all domains, demonstrating Indonesian-specific word-embedding model (ID-BERT) is capable of representing more information and semantic knowledge than the general multilingual model (ML-BERT) in all of domains.

6.5. Effectiveness of Using ID-WOZ

The statistics line chart is shown in Fig. 3, where the four upmost sub-graphs denote intent classification, the four downmost sub-graphs denote slot-filling and the red line is the performance of ID-BERT baseline. The detailed results and all of the line charts of rest domains are in Figure 4.

1). MT methods rely heavily on the quality of translation. We conduct the BLEU (Papineni et al., 2002) test for the entire MultiWOZ, and the performance of translation is 28.46 (BLEU-5) on 30k sentences. However, during the translation of dialogue, one incorrect word would cause misunderstanding. Several examples are shown in Fig. 7. Different sentences in English may be translated from the same source sentence in Indonesian. In the first case, the true meaning is requesting “how much” but the model may misunderstand the customer’s intent into requesting the type of plane ticket. And in the second, the customer is wondering “how” to order a ticket, but the translator gives the result that the customer’s request is“request location”. Based on Fig. 3 and Fig. 4, when the scale of ID-WOZ is negligible, the machine translation has large advantage on intent classification but performs badly on slot-filling. The reason is that the MT method has the ability to adjust or reset the grammar and syntactic structure to the target language, whose characteristic leads to the bad consequences that make the English slots labels dislocated, invalid and wrong.

2). MLEn methods only learn semantic information from English data in the beginning, which causes low accuracy on intent classification than others. When feeding this model with ID-WOZ, it has weakness coming from the English data because the large-scale English data shrinks the feeding ID-WOZ data. This method has strength on slot-filling when the comparison is under small scale of ID-WOZ. Because labels of slot-filling in the English data are accurate and complete. But the performance does not improve when more ID-WOZ data is further used, which shows ML-BERT has a limitation on reaching higher performance. Overall, this method is not recommended for building stable low-resource language dialogue understanding models even with gold annotated data.

3). BiCF does not outperform machine translation when the scale of fed Indonesian data is negligible on the intent classification. When the scale of ID-WOZ data gets larger, its strength of BiCF becomes more obvious. It starts to outperform significantly better than the other methods while the Indonesian data grows. It is capable of avoiding misunderstanding caused by translation and mitigating the shrink effect of the English corpus, which makes it achieve the best performance and even better than the baseline ID-BERT, when the ID-WOZ data reaches around 16k for restaurant, hotel, taxi, attraction domains on the intent classification, i.e., 92.92%, 94.30%, 92.23%, 94.80% on F1 score, respectively. This method outperforms other methods on slot-filling when the ID-WOZ data fed is negligible. Not only it makes use of correct slot-filling annotations from the English dataset, but it can also reduce the bad effects of large-scale English corpus. The accuracy reaches 82.84%, 76.95%, 90.45%, 90.44% on F1 score for restaurant, hotel, taxi, attraction on the slot-filling, respectively. Fig. 4 reports the results of three methods trained by 16K of ID-WOZ. It shows that the cross-lingual method performs better than others when the slots need more words to describe.

7. Conclusion and Future Work

We empirically investigated how to build the low-resource language dialogue understanding models with the English dataset from scratch. Directly translating from English to Indonesian or simply utilizing the multilingual pre-trained model could not perform well. Instead, our framework BiCF enjoys the enriched and accurate English dataset, performs effectively and obtains reliable results. We further release a large Indonesian dialogue dataset and an ID-BERT model for future research.

8. Acknowledgments

This research is supported by the Nature Scientific Foundation of Heilongjiang Province (YQ2021F006)

References

(1)
Ammar et al. (2016) Waleed Ammar, George Mulcaire, Miguel Ballesteros, Chris Dyer, and Noah A Smith. 2016. Many languages, one parser. Transactions of the Association for Computational Linguistics 4 (2016), 431–444.
Artetxe et al. (2017) Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017. Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 451–462.
Asri et al. (2017) Layla El Asri, Hannes Schulz, Shikhar Sharma, Jeremie Zumer, Justin Harris, Emery Fine, Rahul Mehrotra, and Kaheer Suleman. 2017. Frames: A corpus for adding memory to goal-oriented dialogue systems. arXiv preprint arXiv:1704.00057 (2017).
Budzianowski et al. (2018) Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Inigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gašić. 2018. Multiwoz-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. arXiv preprint arXiv:1810.00278 (2018).
Chen et al. (2016) Hongshen Chen, Yue Zhang, and Qun Liu. 2016. Neural network for heterogeneous annotations. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 731–741.
Chen et al. (2017) Tao Chen, Ruifeng Xu, Yulan He, and Xuan Wang. 2017. Improving sentiment analysis via sentence type classification using BiLSTM-CRF and CNN. Expert Systems with Applications 72 (2017), 221–230.
Cheng (2019) Yong Cheng. 2019. Semi-supervised learning for neural machine translation. In Joint Training for Neural Machine Translation. Springer, 25–40.
Chowanda and Chowanda (2017) Andry Chowanda and Alan Darmasaputra Chowanda. 2017. Recurrent neural network to deep learn conversation in indonesian. Procedia computer science 116 (2017), 579–586.
Collins (2011) Michael Collins. 2011. Statistical machine translation: IBM models 1 and 2. Columbia Columbia Univ (2011).
de Melo (2017) Gerard de Melo. 2017. Multilingual vector representations of words, sentences, and documents. In Proceedings of the IJCNLP 2017, Tutorial Abstracts. 3–5.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018).
Dozat and Manning (2016) Timothy Dozat and Christopher D Manning. 2016. Deep biaffine attention for neural dependency parsing. arXiv preprint arXiv:1611.01734 (2016).
Duong et al. (2015) Long Duong, Trevor Cohn, Steven Bird, and Paul Cook. 2015. A neural network model for low-resource universal dependency parsing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 339–348.
Dyer et al. (2013) Chris Dyer, Victor Chahuneau, and Noah A Smith. 2013. A simple, fast, and effective reparameterization of ibm model 2. (2013).
Eric and Manning (2017) Mihail Eric and Christopher D Manning. 2017. Key-value retrieval networks for task-oriented dialogue. arXiv preprint arXiv:1705.05414 (2017).
Fleiss et al. (1969) Joseph L Fleiss, Jacob Cohen, and Brian S Everitt. 1969. Large sample standard errors of kappa and weighted kappa. Psychological bulletin 72, 5 (1969), 323.
Grave et al. (2018) Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. 2018. Learning Word Vectors for 157 Languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018).
Guo et al. (2015) Jiang Guo, Wanxiang Che, David Yarowsky, Haifeng Wang, and Ting Liu. 2015. Cross-lingual dependency parsing based on distributed representations. In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: Long papers). 1234–1244.
Joulin et al. (2016) Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. 2016. FastText.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651 (2016).
Kelley (1984) John F Kelley. 1984. An iterative design methodology for user-friendly natural language office information applications. ACM Transactions on Information Systems (TOIS) 2, 1 (1984), 26–41.
Koto (2016) Fajri Koto. 2016. A publicly available indonesian corpora for automatic abstractive and extractive chat summarization. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). 801–805.
Liu et al. (2018) Hui Liu, Qingyu Yin, and William Yang Wang. 2018. Towards Explainable NLP: A Generative Explanation Framework for Text Classification. arXiv preprint arXiv:1811.00196 (2018).
Lowe et al. (2015) Ryan Lowe, Nissan Pow, Iulian Serban, and Joelle Pineau. 2015. The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems. arXiv preprint arXiv:1506.08909 (2015).
McCann et al. (2017) Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems. 6294–6305.
Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 311–318.
Peters et al. (2018) Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018).
Pires et al. (2019) Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How multilingual is Multilingual BERT? arXiv preprint arXiv:1906.01502 (2019).
Ramos et al. (2003) Juan Ramos et al. 2003. Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning, Vol. 242. Piscataway, NJ, 133–142.
Salton et al. (1982) Gerard Salton, Edward A Fox, and Harry Wu. 1982. Extended Boolean information retrieval. Technical Report. Cornell University.
Schuster et al. (2018) Sebastian Schuster, Sonal Gupta, Rushin Shah, and Mike Lewis. 2018. Cross-lingual Transfer Learning for Multilingual Task Oriented Dialog. arXiv preprint arXiv:1810.13327 (2018).
Schuster et al. (2019) Tal Schuster, Ori Ram, Regina Barzilay, and Amir Globerson. 2019. Cross-Lingual Alignment of Contextual Word Embeddings, with Applications to Zero-shot Dependency Parsing. arXiv preprint arXiv:1902.09492 (2019).
Shah et al. (2018) Pararth Shah, Dilek Hakkani-Tür, Gokhan Tür, Abhinav Rastogi, Ankur Bapna, Neha Nayak, and Larry Heck. 2018. Building a conversational agent overnight with dialogue self-play. arXiv preprint arXiv:1801.04871 (2018).
Tho et al. (2018) Cuk Tho, Arden S Setiawan, and Andry Chowanda. 2018. Forming of Dyadic Conversation Dataset for Bahasa Indonesia. Procedia Computer Science 135 (2018), 315–322.
Tiedemann (2015) Jörg Tiedemann. 2015. Improving the cross-lingual projection of syntactic dependencies. In Proceedings of the 20th Nordic Conference of Computational Linguistics, NODALIDA 2015, May 11-13, 2015, Vilnius, Lithuania. Linköping University Electronic Press, 191–199.
Tiedemann and Agić (2016) Jörg Tiedemann and Zeljko Agić. 2016. Synthetic treebanking for cross-lingual dependency parsing. Journal of Artificial Intelligence Research 55 (2016), 209–248.
Vaswani et al. (2018) Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan N Gomez, Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, et al. 2018. Tensor2tensor for neural machine translation. arXiv preprint arXiv:1803.07416 (2018).
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.
Wang et al. (2017) Hongmin Wang, Yue Zhang, GuangYong Leonard Chan, Jie Yang, and Hai Leong Chieu. 2017. Universal dependencies parsing for colloquial singaporean english. arXiv preprint arXiv:1705.06463 (2017).
Wen et al. (2016) Tsung-Hsien Wen, David Vandyke, Nikola Mrksic, Milica Gasic, Lina M Rojas-Barahona, Pei-Hao Su, Stefan Ultes, and Steve Young. 2016. A network-based end-to-end trainable task-oriented dialogue system. arXiv preprint arXiv:1604.04562 (2016).
Williams et al. (2013) Jason Williams, Antoine Raux, Deepak Ramachandran, and Alan Black. 2013. The dialog state tracking challenge. In Proceedings of the SIGDIAL 2013 Conference. 404–413.
Wu et al. (2019) Chien-Sheng Wu, Andrea Madotto, Ehsan Hosseini-Asl, Caiming Xiong, Richard Socher, and Pascale Fung. 2019. Transferable Multi-Domain State Generator for Task-Oriented Dialogue Systems. arXiv preprint arXiv:1905.08743 (2019).
Wu et al. (2016) Yu Wu, Wei Wu, Chen Xing, Ming Zhou, and Zhoujun Li. 2016. Sequential matching network: A new architecture for multi-turn response selection in retrieval-based chatbots. arXiv preprint arXiv:1612.01627 (2016).
Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv preprint arXiv:1906.08237 (2019).
Zhang et al. (2019) Meishan Zhang, Yue Zhang, and Guohong Fu. 2019. Cross-Lingual Dependency Parsing Using Code-Mixed TreeBank. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 996–1005.