This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Building Dialogue Understanding Models for Low-resource Language Indonesian from Scratch

Donglin Di [email protected] Advance.AI Weinan Zhang [email protected] Harbin Institute of Technology Yue Zhang [email protected] Westlake University  and  Fanglin Wang [email protected] Advance.AI
(2023)
Abstract.

Making use of off-the-shelf resources of resource-rich languages to transfer knowledge for low-resource languages raises much attention recently. The requirements of enabling the model to reach the reliable performance lack well guided, such as the scale of required annotated data or the effective framework. To investigate the first question, we empirically investigate the cost-effectiveness of several methods to train the intent classification and slot-filling models for Indonesia (ID) from scratch by utilizing the English data. Confronting the second challenge, we propose a Bi-Confidence-Frequency Cross-Lingual transfer framework (BiCF), composed by “BiCF Mixing”, “Latent Space Refinement” and “Joint Decoder”, respectively, to tackle the obstacle of lacking low-resource language dialogue data. Extensive experiments demonstrate our framework performs reliably and cost-efficiently on different scales of manually annotated Indonesian data. We release a large-scale fine-labeled dialogue dataset (ID-WOZ) and ID-BERT of Indonesian for further research.

dialogue datasets, intent classification, slot-filling, Indonesian
copyright: acmcopyrightjournalyear: 2023doi: 10.1145/3575803ccs: Computing methodologies Neural networksccs: Computing methodologies Discourse, dialogue and pragmatics

1. Introduction

It is generally accepted that the neural dialogue understanding model relies heavily on large scale of training data (Liu et al., 2018). Existing works have been conducted mostly on rich-resource languages such as English(Wen et al., 2016; Lowe et al., 2015) and Chinese(Wu et al., 2016). But thousands of low-resource languages in this world lacks training data. It is impractical and cost-ineffective to collect and annotate enough large-scale datasets for low-resource languages(Grave et al., 2018) to train the dialogue understanding models. Therefore, as shown in Fig. 1, it remains huge challenges on how to efficiently adapt existing research resources and findings to low-resource languages (e.g., Indonesian (ID)), so that the needs for understanding the multilingual task-oriented dialogue(Schuster et al., 2018; Schuster et al., 2019) can be addressed effectively.

Refer to caption
Figure 1. The investigation of utilizing off-the-shelf resources and models for generating low-resource language dialogue understanding models from scratch.

One challenge is how to make use of off-the-shelf resources of rich-resource languages effectively (e.g., English (Budzianowski et al., 2018)). The intuitive method is to use a neural machine translation system (Vaswani et al., 2018; Cheng, 2019) to translate the English dataset into Indonesian, and then train the dialogue understanding models on the translated data. Another strategy is to utilize multilingual word embeddings (Devlin et al., 2018; Pires et al., 2019), which allows the dialogue model trained on the English dataset to be directly applied to Indonesian since the pre-trained multilingual model contains vocabulary of both them. Each of the above methods has its own strengths and limitations. The former allows us to use language-specific pre-trained model (Peters et al., 2018; Devlin et al., 2018; Yang et al., 2019) for better representation, which could be obtained from training on large-scale unlabelled general documents or web pages. However, methods of this branch suffer from errors in machine translation and invalid dislocated annotations from source corpus, which could significantly influence the subsequent dialogue modeling (i.e., slot-filling). The latter suffers from intrinsic differences between English and Indonesian, including variations in syntactic and semantic patterns.

The second challenge relies on how to transfer existing models to the target low-resource language. Cross-lingual transfer learning (Schuster et al., 2018) can potentially alleviate the limitations of both methods. One possible approach is to align the contextual word embeddings or sentence-level encoding in the semantic latent space (Schuster et al., 2019). It can avoid semantic misunderstanding and syntactic mistakes. However, this method can be impacted by imperfect alignments and implementation of it is complex, which leads to applying or deploying slowly.

In this work, we propose a Bi-confidence-frequency Cross-lingual Transfer framework (BiCF), overcoming challenges illustrated above. For the first challenge, we adopt the word-level alignment strategy (Zhang et al., 2019), which has been demonstrated as effective as phrase-level alignment yet much simpler and more stable (Tiedemann, 2015; Tiedemann and Agić, 2016). Specifically, the first stage of BiCF is Bi-confidence-frequency Mixing, utilizing the English dataset to generate code-mixed data, which avoids sentence-level translation errors as well as labels dislocation. The mixed data not only takes the importance-frequency and translating-confidence into consideration but also contains gold annotations for Indonesian from English datasets. And for the second challenge, in our framework, Latent Space Refinement and Joint Decoder are designed on the top of resulting high-quality mixed data, utilizing and refining the pre-trained off-the-shelf word embedding models, to train the dialogue understanding models (i.e., intent classification and slot-filling) for Indonesian eventually.

To conduct extensive experiments for Indonesian, we follow the method of MultiWOZ (Budzianowski et al., 2018), which is a large-scale task-oriented English dialogue dataset, to collect and annotate a counterpart and richer-domain dataset in Indonesian (ID), named ID-WOZ. Extensive experiments show that the proposed framework can achieve comparable and better performance with less gold annotated data. We quantify the influence of each factor in our experiments.

The main contributions of this paper are summarized as follows:

  • We propose a framework (BiCF) to utilize English dataset for training Indonesian dialogue understanding models and achieve good performance on Indonesian dialogue dataset.

  • We release a large-scale manually annotated multi-domain ID-WOZ dialogue dataset, together with a pre-trained ID-BERT model as the resource contributions for low-resource language dialogue understanding tasks.

  • We investigate the demand for annotated data of well-performing dialogue understanding models, which thereby may instruct related research works on collecting datasets or training models for other low-resource languages.

2. Related Work

Low-resource Language. Exisiting works (Guo et al., 2015; Duong et al., 2015; Ammar et al., 2016), and (Wang et al., 2017) work on multilingual parsing on low-resource languages, which are helpful to improve the low-resource language understanding. Wang et al. (Wang et al., 2017) propose to integrate English syntactic knowledge into a parser trained on the Singlish treebank, and shows that it is reasonable to leverage English to improve low-resource language models.
Cross-lingual Transfer. Artetxe et al. (Artetxe et al., 2017) proposes a self-learning framework and a small size of word dictionary to learn a mapping between source and target word embeddings. Zhang et al. (Zhang et al., 2019) focuses on improving dependency parsing by mixing confident target words into source treebank. Schuster et al. (Schuster et al., 2019) utilizes Multilingual CoVe embeddings obtained from Machine Translation systems (McCann et al., 2017) in Thai and Spanish for zero-shot dependency parsing. Relying on aligned parallel sentence pairs suffers from noise and imperfect alignments (Zhang et al., 2019). In line with these methods, encoding the semantic information directly within the same cross-lingual latent space could avoid semantic misunderstanding from machine translation or wrong alignments.

3. ID-WOZ

There has been a lack of available datasets for training natural dialogue understanding systems in regional low-resource languages, such as Indonesian (Chowanda and Chowanda, 2017; Koto, 2016; Tho et al., 2018). As a result, we build our ID-WOZ dataset from scratch. The detailed statistics is reported in Experiment Section. In particular, we organize a structured annotation scheme with structured semantic labels, drawing on the experiences of previous work (Williams et al., 2013; Asri et al., 2017; Eric and Manning, 2017; Shah et al., 2018; Budzianowski et al., 2018).

3.1. Coverage

ID-WOZ is constructed with the goal of obtaining highly natural conversations between a customer and an agent or a query information center focusing on daily life. We consider various possible dialogue scenarios ranging from basic requests like hotel, restaurant, to a few urgent situations such as hospital or police. Our dataset consists of nine domains, namely plane, taxi, wear, restaurant, movie, hotel, attraction, hospital, police, most of which are extended domains including the sub-task Booking (with the exception of police).

3.2. Collection and Annotation

We adopt the Wizard-of-Oz (Kelley, 1984) dialogue-collecting approach, which has been shown effective for obtaining a high-quality corpus at relatively low costs and with a small-time effort. Following the successful experience of MultiWOZ (Budzianowski et al., 2018), we create a large-scale corpus of natural human-human conversations on a similar scale. Based on the given templates for several domains, the users and wizards generate conversations using heuristic-based rules to prevent the overflow of information. We design and develop a collection-annotation pipeline platform with a user-friendly structure for building up the dataset, shown in Experiment Section.

In order to accelerate and optimize the process of collection and annotating, we design and develop a pipeline platform. Our platform consists of three stages, “collection - annotation - statistics & analysis”, which are executed synchronously after the initialization process. We split a number of well-trained annotators (i.e., 80 local people, 70 of them whose mother language are ID, 10 of them are bilingual citizens, plus 2 main organizers) into two groups to produce dialogue and annotation. A quarter of annotators (i.e., 20) are trained following the templates we provide to play the wizard role. After collecting 1k dialogues initially (about one week), while the collecting conversation is still ongoing, the second group of annotators (i.e., 62) join in to work towards the detailed full-labeled corpus, including domains, actions, intents, and slots.

The quality is assured in three processes, namely “scripts checking”, “cross-checking”, and “supervisor-checking”. In particular, the scripts can filter the hypothesis cases which have potential faults such as vacant labels or malformed. For cross-checking process, the annotators are assigned not only to fresh unlabeled annotation tasks but also a few sampling labeled cases (i.e., 20%) from their peers, in a double-blind process. The cases passing the cross-checking procedure will be sampled and handed over to the supervisors (i.e., the two organizers), who are more familiar with the details of the entire annotation task to control the overall consistency and accuracy of the annotation. We adopt the inter-annotator agreement (IAA) (Fleiss et al., 1969) to measure how well our recruited annotators can make the same annotation decision for a certain category further, as follows:

(1) κpope1pe=11po1pe\kappa\equiv\frac{p_{o}-p_{e}}{1-p_{e}}=1-\frac{1-p_{o}}{1-p_{e}}

where pop_{o} and pep_{e} denote the relative observed agreement among raters, and hypothetical probability of chance agreement, respectively. The average score of our dataset is 0.834.

3.3. Statistics and Analysis

Table 1 compares our dataset with existing datasets in Indonesian, and the English dialogue dataset MultiWOZ (Budzianowski et al., 2018). ID Chat (Koto, 2016) is the first publicly available Indonesian chat corpora, and draw a few related research on the Indonesian Language dialogue (Chowanda and Chowanda, 2017). Dyadic Chat (Tho et al., 2018) is another public chat corpus on Indonesian, which focuses on the dyadic term. Dyadic is a term that describes the relationship between two people, for instance, the romantic relationship between a couple. Compared with these small-scale datasets, ID-WOZ is the first to contain large-scale (about ten thousand dialogues in multi-domains) corpus focusing on general task-oriented chat.

MultiWOZ (Budzianowski et al., 2018) is a large-scale multi-domain task-oriented English dialogue dataset, including seven distinct domains (taxi, restaurant, hotel, attraction, hospital, police, train) and fine-labeled actions and slots in the spoken language understanding stage. Considering the regional cultural background, our collected dataset contains a few more general domains (i.e., wear, movie, plane) and more corresponding slots type, such as clothes type, movie genre, movie synopsis.

Dataset ID Chat Dyadic Chat MultiWOZ ID-WOZ
Domains None None 7 9
Language ID ID En ID
Total # dials 300 79 8, 438 9, 189
Total # tokens 150, 000 3164 1, 520, 970 1, 551, 591
Total # utters 1, 000 158 142, 974 251, 184
Avg. # turns 3 3 13.68 13.67
Avg. # slots - - 25 8.8
Table 1. Comparison of ID-WOZ with other related datasets in several statistics metrics.

4. ID-BERT

Even though there are several off-the-shelf pre-trained BERT models for rich-resource languages such as English and Chinese, pre-trained language-specific models for low-resource languages like Indonesian, is still not available to our knowledge. We release a pre-trained model for Indonesian named ID-BERT as another resource contribution. Although most related work (Schuster et al., 2019; Pires et al., 2019) relies on the pre-trained Multilingual-BERT model and fine-tunes it for low-resource languages, our main goal is to build up a relatively reliable dialogue system and to experiment how to overcome the gap between language-specific BERT and the multilingual-BERT. Therefore, we put in effort to train a ID-BERT for comparing appearance on spoken language understanding or furthermore tasks. We pre-train a BERT for Indonesian from scratch by about 3.3 billion tokens from Indonesian websites document-level corpus, which covers news reports, research papers, daily articles, and other text genres. The size of our ID-BERT vocabulary is 0.9M, which is much larger than Multilingual-BERT (0.12M). We believe that this size of the vocabulary is sufficient to cover most of the scenarios of daily multi-domain task-oriented dialogue in Indonesian. The training takes one week by using Google Cloud TPU v3_8, and our ID-BERT (Cased, L=12, H=768, A=12) are eventually obtained.

5. BiCF Cross-lingual Transfer

Refer to caption
Figure 2. Illustration of the proposed framework (BiCF), which consists of BiCF Mixing, Latent Space Refinement, and Joint Decoder. The frequency-word and confidence-word set in the first stage are derived from English dataset and confidence-translated parallel sentences, respectively. By fusion and mixing, the mixed data is generated. The cross-lingual space refinement module will generate a target-specific embedding model to represent Indonesian better. The final stage is to decode and output intent and slots jointly.

We here illustrate our proposed pipeline framework – “BiCF Cross-lingual Transfer” (BiCF) in detail. It mainly consists of three components, namely “BiCF Mixing”, “Latent Space Refinement”, and “Joint Decoder”. As shown in Fig. 2, the BiCF mixing step replaces a few English words with Indonesian. Then we train and refine the cross-lingual semantic embedding latent space based on the mixed data with gold annotations from English dataset. Finally, we adopt the combination of BiLSTM and CRF to decode the intent and slots jointly.

5.1. BiCF Mixing

The first stage of our framework is “Bi-confidence-frequency Mixing” (BiCF Mixing). As shown in Fig. 2, we use the source language data in two steps. The first is to generate the frequency-word set (Wfreq\textbf{W}_{freq}) of source data. The second is to obtain the word alignment with the translating-confidence (λconf\lambda_{conf}) of each word and generate confidence-word set (Wconf\textbf{W}_{conf}). The goal of this stage is to select both frequent and high-confidence word pairs for English and Indonesian, and yield mixed data 𝒯mix\mathcal{T}_{mix}.

Given the set of source sentences 𝒮={s1,s2,,sn}\mathcal{S}=\{s_{1},s_{2},...,s_{n}\}, we calculate TF-IDF (Salton et al., 1982; Ramos et al., 2003) for each word in the source dialogue corpus, as shown in Eq. 2:

(2) {tf(i,j)=𝒩(wis,sj)k𝒩(wks,sj)idf(i)=log|S|1+|j:wisdj|tf-idf(i,j)=tf(i,j)×idf(i)\displaystyle\left\{\begin{matrix}tf_{(i,j)}&=\frac{\mathcal{N}(w_{i}^{s},s_{j})}{\sum_{k}\mathcal{N}(w_{k}^{s},s_{j})}\\ idf_{(i)}&=\log\frac{\left|S\right|}{1+\left|j:w_{i}^{s}\in d_{j}\right|}\\ tf\texttt{-}idf_{(i,j)}&=tf_{(i,j)}\times idf_{(i)}\\ \end{matrix}\right.

where 𝒩(wis,sj)\mathcal{N}(w_{i}^{s},s_{j}) is the number of occurrences of the word wisw_{i}^{s} in the source language sentence sjs_{j}, and the denominator (k𝒩(wks,sj)\sum_{k}\mathcal{N}(w_{k}^{s},s_{j})) is the sum of occurrences of all terms wksw_{k}^{s} in the sentence sj𝒮s_{j}\in\mathcal{S}. |S|\left|S\right| represents the number of sentences and |j:wisdj|\left|j:w_{i}^{s}\in d_{j}\right| denotes the number of sentences containing the word wisw_{i}^{s}. The frequency-word set 𝐖freq=(wis,ri),,(wjs,rj)\mathbf{W}_{freq}=\langle(w_{i}^{s},r_{i}),...,(w_{j}^{s},r_{j})\rangle are obtained by sorting the output (tf-idf(i,j)tf\texttt{-}idf_{(i,j)}) from the TF-IDF algorithm, where rir_{i} denotes the frequency score.

Algorithm 1 BiCF Mixing
1:𝒮\mathcal{S}, 𝐖freq\mathbf{W}_{freq}, λfreq\lambda_{freq}, 𝐖conf\mathbf{W}_{conf}, λconf,θ\lambda_{conf},\theta
2:𝒯mix\mathcal{T}_{mix}
3:𝐖^freq\widehat{\mathbf{W}}_{freq}\leftarrow Thresh(𝐖freq,λfreq\mathbf{W}_{freq},\lambda_{freq})
4:𝐖^conf\widehat{\mathbf{W}}_{conf}\leftarrow Thresh(𝐖conf,λconf\mathbf{W}_{conf},\lambda_{conf})
5:𝐖~sub\widetilde{\mathbf{W}}_{sub}\leftarrow Fusion(𝐖^freq,𝐖^conf,θ\widehat{\mathbf{W}}_{freq},\widehat{\mathbf{W}}_{conf},\theta)
6:𝒯mix\mathcal{T}_{mix}\leftarrow Φ\Phi
7:for s𝒮s\in\mathcal{S} do
8:     s^\widehat{s}\leftarrow ss
9:     for wssw^{s}\in s do
10:         if ws𝐖~subw^{s}\in\widetilde{\mathbf{W}}_{sub} then
11:              wtw^{t}\leftarrow Get(𝐖~sub,ws\widetilde{\mathbf{W}}_{sub},w^{s})
12:              s^\widehat{s}\leftarrow Mixing(s^,ws,wt\widehat{s},w^{s},w^{t})
13:         end if
14:     end for
15:     𝒯mix\mathcal{T}_{mix}\leftarrow 𝒯mixs^\mathcal{T}_{mix}\cup\widehat{s}
16:end for
17:return 𝒯mix\mathcal{T}_{mix}

Then we adopt small-scale high-quality parallel sentences (i.e., 1K), translated by skilled bilingual translators, to generate the alignments of words by using fast_align (Dyer et al., 2013). Given a few English sentences and their corresponding confidently translated sentences, the fast_align model uses a log-linear reparameterization of IBM Model 2 (Collins, 2011) to generate a set of confidence-word pairs 𝐖conf=(wis,wit),pi,,(wjs,wjt),pj\mathbf{W}_{conf}={\langle(w_{i}^{s},w_{i}^{t}),p_{i}\rangle,...,\langle(w_{j}^{s},w_{j}^{t}),p_{j}\rangle} with Indonesian word and confidence score, denoted by witw_{i}^{t} and pip_{i}, respectively.

As shown in Algorithm 1, after selecting the words both over the frequency threshold λfreq\lambda_{freq} and the confidence threshold λconf\lambda_{conf}, we then fusing words to generate the substitute words set 𝐖sub\mathbf{W}_{sub}. ThreshThresh function of line 1 and 2 in Algorithm 1 are designed as Eq. 3:

(3) 𝐖^=Sort(𝐖(),𝒫())λ()\widehat{\mathbf{W}}=Sort(\mathbf{W}(\cdot),\mathcal{P}(\cdot))\odot\lambda_{(\cdot)}

where 𝐖()\mathbf{W}(\cdot) denotes frequency-word set (𝐖freq\mathbf{W}_{freq}) or confidence-word set (𝐖conf\mathbf{W}_{conf}). 𝒫()\mathcal{P}(\cdot) denotes frequency scores rir_{i} or confidence score pip_{i}. \odot is selecting the top subset operation. And FusionFusion function in line 3 can be implemented as Eq. 4:

(4) 𝐖~=(𝐖^freqθ)(𝐖^conf(1θ))\widetilde{\mathbf{W}}=(\widehat{\mathbf{W}}_{freq}\odot\theta)\cap(\widehat{\mathbf{W}}_{conf}\odot(1-\theta))

where θ\theta is the hyper-parameter to adjust the ratio of two branch of word sets. Lines 4 to 13 in Algorithm 1 illustrate the mixing procedure. We incrementally substitute the source word wsw^{s} of a temporarily copied sentence s^\widehat{s} with the corresponding target word wtw^{t}. In this way, the mixed corpus 𝒯mix\mathcal{T}_{mix} is generated, consisting of both English words and Indonesian words.

5.2. Cross-lingual Space Refinement

We train and refine the initially pre-trained multilingual model (i.e., Multilingual-BERT) on the mixed corpus 𝒯mix\mathcal{T}_{mix} with annotations from the source English dataset. This operation could update the embeddings of English words as well as the Indonesian words. Therefore this stage allows our model to make use of English corpora and obtain a refined latent space to improve semantic representations. The multilingual latent space can be updated with the discriminative training process as Eq. 5:

(5) {Θi+1l=ΘilηlΘlJ(Θ)ηi1=ξηi\left\{\begin{matrix}\Theta_{i+1}^{l}=\Theta_{i}^{l}-\eta^{l}\cdot\nabla_{\Theta^{l}}J(\Theta)\\ \eta^{i-1}=\xi\cdot\eta^{i}\end{matrix}\right.

where ηl\eta^{l} denotes the learning rate of the ll-th layer. Θil\Theta_{i}^{l} represents the parameters of the model at ll-th layer in ii step. ΘlJ(Θ)\nabla_{\Theta^{l}}J(\Theta) is the gradient of parameters Θil\Theta_{i}^{l} at ll-th layer with regard to the model’s objective function, i.e., supervised by intent classification and slot-filling annotations in our model.

When the performance is stable on the training set (around 25 epochs in our experiments), we save the model that performs best on the validation set as the mixed refined embedding model, denoted in blue embedding space in the middle of Fig. 2. Then we feed fine-labeled Indonesian data into the mixed refined embedding model and transfer one more time to obtain a refined target-specific embedding model. In this way, by utilizing the English dataset, we generate a better representation latent space for Indonesian, i.e., encoding each sentence into 1×768\mathbb{R}^{1\times 768} representation feature vector.

5.3. Joint Decoder

The decoder of our framework performs two tasks, i.e., intent classification and slot-filling sequence labeler, respectively. We make use of deep bi-directional long short-term memory (Bi-LSTM) networks and a CRF layer, as shown in Eq. 6, to predict the classifications for the input words (Chen et al., 2017; Wang et al., 2017; Chen et al., 2016; Dozat and Manning, 2016).

(6) hi=[fl(hi+1,xi),fr(hi1,xi)]BiLSTMh_{i}=[f_{l}(\overleftarrow{h_{i+1}},x_{i}),f_{r}(\overrightarrow{h_{i-1}},x_{i})]\Rightarrow BiLSTM

where flf_{l} is the hidden state of backward propagation and frf_{r} is the hidden state of forward feeding in BiLSTM, respectively. And then CRF layer is appended to decode slot classes further and generate results of the framework finally.

6. Experiments

6.1. Dataset and Evaluation

We take MultiWOZ (Budzianowski et al., 2018) as the English dataset and our collected ID-WOZ as the target language Indonesian dataset. As the hospital and police domains in MultiWOZ contain very few dialogues (5% of total dialogues), and only appear in the training dataset, we choose to ignore them in our experiments, following (Wu et al., 2019). The train domain is invalid in Indonesian data because it reflects the cultural difference between English and Indonesia. Therefore, we only adopt four domains as the main experiment restaurant, hotel, taxi, attraction shared by MultiWOZ and ID-WOZ. Statistics of them are shown in Table 2. In order to suit the testing set, we have to merge the annotations of English data with Indonesian dataset, thereby abandoning a few types of labels, such as reference, choice in MultiWOZ. After processing, the statistics for the four domains in two datasets are reported in Table 2. Contrastive study of differences between our dataset and similar well-known datasets are shown in Table 3 All of the experiments are evaluated on the same test set from ID-WOZ (1K dialogues, 250 dialogues for each domain), which suits the local cultural background. We use the F1 score as the evaluation metric, which is calculated by the precision rate and recall rate.

Dataset Domains # Sentences # Slots # Intent
MultiWOZ Restaurant 62, 703 28, 351 41, 177
Hotel 64, 284 25, 985 42, 434
Taxi 48, 080 7, 160 28, 976
Attraction 55, 186 21, 004 34, 053
ID-WOZ Restaurant 28, 095 5, 809 22, 312
Hotel 30, 865 8, 720 24, 694
Taxi 28, 178 6, 038 22, 168
Attraction 36, 523 9, 198 29, 513
Table 2. Statistics for total number in four domains.
Dataset Twitter Ubuntu Sina Weibo WOZ 2.0 Frames M2M MultiWOZ ID-WOZ
Domains Unrestricted Ubuntu Unrestricted Unrestricted Unrestricted Unrestricted 7 9
Language English English Chinese English English English English Indonesian (+En)
Total # dialogues 1.3M 930K 4.5M 600 1, 369 1, 500 8, 438 9,189 (+1k)
Total # tokens - - - 50, 264 251, 867 121, 977 1, 520, 970 1, 551, 591
Avg. # Turns 2.10 7.71 2.3 7.45 14.60 9.86 13.68 13.67
Avg. # slots - - - 4 61 14 25 8.8
Table 3. Comparison of our dataset to similar well-known datasets.

6.2. Model Settings

There are three branches of methods to utilize English dataset and pre-trained models, i.e., Machine Translation based (MT); Multilingual pre-trained embedding model with English corpus (MLEn); and our proposed BiCF.

1) MT. We adopt the machine translation preprocessing method and extract word embeddings (i.e., 1×768\mathbb{R}^{1\times 768}) by random initiation, pre-trained multilingual-BERT (ML-BERT) and ID-BERT. We also take Indonesian-fastText (ID-fastText) (Joulin et al., 2016), Transformer (Vaswani et al., 2017) and our pre-trained Indonesian-Word2vec (ID-Word2vec) into comparison. (1×300\mathbb{R}^{1\times 300})

2) MLEn. We adopt three pre-trained multilingual word embedding models in this baseline, namely multilingual fastText (ML-fastText) (Joulin et al., 2016), multilingual Word2vec (ML-Word2vec) (de Melo, 2017), and multilingual-BERT (ML-BERT) (Devlin et al., 2018). By extracting the embeddings of MultiWOZ and ID-WOZ, we encode each sentence into 1×300\mathbb{R}^{1\times 300}, 1×300\mathbb{R}^{1\times 300} and 1×768\mathbb{R}^{1\times 768} dimensions, respectively.

3) BiCF. We generate about 1.5K confident word pairs from MultiWOZ and 1K translated parallel sentences. For our method BiCF, the training process converges after 20 epochs. It reaches 91.13, 87.84/ 90.17, 82.09/ 93.37, 82.98/ 89.55, 85.54 for the F1 score of intent classification and slot-filling on the MultiWOZ validation set of restaurant, hotel, taxi, attraction domains, respectively. And then the Indonesian training data of ID-WOZ is fed to refine the Indonesian embedding model.

6.3. Development Experiments

We feed 16K Indonesian sentences of ID-WOZ to each method and validate their performance on same test set of ID-WOZ. In our implementation, five-fold cross-validation is employed to investigate the optimal parameter setting within training datasets (learning_rate=e3,batch_size=64,dropout_rate=0.1,optimizer=sgdlearning\_rate=e^{-3},batch\_size=64,dropout\_rate=0.1,optimizer=sgd). To verify the stability of the proposed method, we run the experiments five times for each set of parameter settings and compare their mean performance, reported in Table 4.

Methods + Emb. Domains Restaurant Hotel Taxi Attraction
Intent Slots Intent Slots Intent Slots Intent Slots
MT Random Init 85.48 74.36 82.73 73.49 89.15 80.22 89.64 86.26
MT ID-fastText 86.03 75.27 83.17 74.03 89.82 80.28 90.02 86.88
MT ID-Word2vec 88.22 76.70 86.33 74.11 89.91 81.81 91.55 86.90
MT Transformer 90.13 79.91 91.89 74.27 90.25 82.11 92.85 87.16
MT ML-BERT 91.63 79.22 92.52 73.83 91.20 82.34 93.77 87.31
MT ID-BERT 92.37 81.88 93.78 75.79 91.76 83.59 94.07 89.63
MLEn ML-fastText 86.00 76.11 83.10 74.91 89.22 80.88 90.31 86.93
MLEn ML-Word2vec 88.22 77.70 86.33 74.11 89.91 81.81 91.55 86.90
MLEn ML-BERT 90.42 79.79 92.01 74.28 90.47 82.91 93.18 87.77
BiCF ML-fastText 86.21 76.16 83.31 75.01 90.24 82.58 90.84 87.23
BiCF ID-fastText 87.08 76.34 84.21 75.79 90.83 82.920 91.52 87.67
BiCF ML-Word2vec 88.80 77.91 87.12 74.24 90.01 82.87 91.58 87.03
BiCF ID-Word2vec 88.92 78.84 88.52 74.35 90.31 83.15 91.82 87.49
BiCF ML-BERT 92.92 82.84 94.30 76.95 92.23 90.45 94.80 90.44
BiCF ID-BERT 93.02 82.91 94.73 77.15 92.73 91.03 94.88 90.74
Table 4. Experimental comparison on ID-WOZ dataset. (“{\dagger}” denotes the significance testing, pp-value<0.05value<0.05.)

We also conduct a series of experiments by feeding batches of annotated Indonesian data (i.e., 1K sentences, 2K sentences, 4K sentences, …, full-scale). We pick the results of restaurant, hotel, taxi, and attraction domains in Fig. 3 and Fig. 4, as they are widely usable domains and have the most scale of dialogue data and annotations both in MultiWOZ and ID-WOZ. The entire annotated dataset, experiment results and codes are detailedly reported in Table 5 and Code 1. Also, we conduct a comparison experiment for Multilingual-BERT (ML-BERT) and ID-BERT on all domains of full-scale ID-WOZ, as reported in Table 6.

Methods ID-WOZ Restaurant Hotel Taxi Attraction
Intent Slots Intent Slots Intent Slots Intent Slots
MT (ID-BERT) ID-WOZ-1000 87.33 56.67 90.83 60.14 86.66 40.28 90.01 62.29
ID-WOZ-2000 88.97 59.74 91.67 66.63 86.98 59.88 91.02 76.05
ID-WOZ-4000 90.01 70.67 93.23 69.35 89.50 74.09 93.05 83.73
ID-WOZ-8000 91.67 80.57 93.65 73.75 90.63 82.09 93.95 87.96
ID-WOZ-16000 92.37 81.88 93.78 75.79 91.76 83.59 94.07 89.63
ID-WOZ-All 92.25 81.87 93.42 75.65 91.67 82.17 94.25 90.40
MLEn (ML-BERT) ID-WOZ-1000 84.11 55.59 89.73 60.41 82.93 22.66 89.32 64.44
ID-WOZ-2000 86.57 56.86 91.51 65.26 86.37 40.63 91.57 70.56
ID-WOZ-4000 89.57 68.99 91.90 72.20 87.93 46.42 92.58 84.22
ID-WOZ-8000 90.93 73.37 93.42 75.15 88.08 58.63 94.03 86.85
ID-WOZ-16000 90.92 74.24 93.28 75.89 88.12 64.12 94.11 87.67
ID-WOZ-All 90.89 75.36 93.23 75.97 88.34 64.86 94.25 88.71
BiCF (ML-BERT) ID-WOZ-1000 84.23 59.92 87.66 59.81 84.78 72.31 87.87 69.41
ID-WOZ-2000 86.69 66.67 90.35 61.93 86.69 75.25 90.04 80.05
ID-WOZ-4000 89.07 76.10 91.77 68.85 88.82 81.87 92.72 85.52
ID-WOZ-8000 92.23 78.34 93.13 73.71 91.55 86.48 93.46 88.41
ID-WOZ-16000 92.92 82.84 94.30 76.95 92.23 90.45 94.80 90.44
ID-WOZ-All 92.60 82.67 94.24 76.91 92.25 89.43 94.77 90.45
ID-BERT ID-WOZ-All 92.22 82.14 93.91 76.88 91.97 88.13 93.96 90.20
Table 5. Performance comparison of different methods on the selected MultiWOZ and ID-WOZ with different amounts of feeding ID-WOZ data.
Domains ML-BERT ID-BERT
Intent Slots Intent Slots
Restaurant 91.07 77.68 92.22 82.14
Hotel 92.78 74.91 93.91 76.88
Taxi 90.84 82.91 91.97 88.13
Attraction 93.25 88.04 93.96 90.20
Plane 91.36 92.77 93.42 93.11
Police 90.02 88.89 92.78 90.07
Movie 90.57 86.14 91.76 87.98
Hospital 92.64 84.15 93.85 86.09
Wear 90.77 87.02 91.80 88.34
Table 6. Experimental comparison of ML-BERT and ID-BERT on full-scale ID-WOZ.

6.4. Results Analysis

The results of the method in Section 6.2 are shown in Table 5, with English data of MultiWOZ and 16k Indonesian data of ID-WOZ. The method of machine translation based methods (MT + ML-BERT/ ID-BERT) surpass multilingual model with English data (MLEn + ML-BERT) on the intent classification task, outperforming by about 1.21%, 1.95%; 0.51%, 1.77%; 0.73%, 1.29% and 0.59%, 0.89% on F1 score for restaurant, hotel, taxi, attraction, respectively. The main reason is that the machine translation methods enjoy much more Indonesian sentences with corresponding intention labels. However, on the slot-filling task, the machine translation methods are weaker. Because the machine translation methods suffer from invalid or mismatching labels after translation. Overall, our proposed framework (BiCF + ML-BERT / ID-BERT) performs better than others in both tasks, as we are capable of utilizing the English intention labels and correct slot-filling annotations effectively. And from Table 6, we can see that ID-BERT outperforms ML-BERT across all domains, demonstrating Indonesian-specific word-embedding model (ID-BERT) is capable of representing more information and semantic knowledge than the general multilingual model (ML-BERT) in all of domains.

Refer to caption
Figure 3. The comparison of different methods on four domains.
Refer to caption
Figure 4. The comparison of different methods on four domains.

6.5. Effectiveness of Using ID-WOZ

The statistics line chart is shown in Fig. 3, where the four upmost sub-graphs denote intent classification, the four downmost sub-graphs denote slot-filling and the red line is the performance of ID-BERT baseline. The detailed results and all of the line charts of rest domains are in Figure 4.

1). MT methods rely heavily on the quality of translation. We conduct the BLEU (Papineni et al., 2002) test for the entire MultiWOZ, and the performance of translation is 28.46 (BLEU-5) on 30k sentences. However, during the translation of dialogue, one incorrect word would cause misunderstanding. Several examples are shown in Fig. 7. Different sentences in English may be translated from the same source sentence in Indonesian. In the first case, the true meaning is requesting “how much” but the model may misunderstand the customer’s intent into requesting the type of plane ticket. And in the second, the customer is wondering “how” to order a ticket, but the translator gives the result that the customer’s request is“request location”. Based on Fig. 3 and Fig. 4, when the scale of ID-WOZ is negligible, the machine translation has large advantage on intent classification but performs badly on slot-filling. The reason is that the MT method has the ability to adjust or reset the grammar and syntactic structure to the target language, whose characteristic leads to the bad consequences that make the English slots labels dislocated, invalid and wrong.

2). MLEn methods only learn semantic information from English data in the beginning, which causes low accuracy on intent classification than others. When feeding this model with ID-WOZ, it has weakness coming from the English data because the large-scale English data shrinks the feeding ID-WOZ data. This method has strength on slot-filling when the comparison is under small scale of ID-WOZ. Because labels of slot-filling in the English data are accurate and complete. But the performance does not improve when more ID-WOZ data is further used, which shows ML-BERT has a limitation on reaching higher performance. Overall, this method is not recommended for building stable low-resource language dialogue understanding models even with gold annotated data.

Refer to caption
Figure 5. An example of editing template interface.
Refer to caption
Figure 6. An example of the annotation procedure. For domain/action/intent classification, the annotator could click these multiple labels, defined before for each domain. As for slot-filling, our platform provides a fashion approach: click and underscore the content and select its slot type, pop-out automatically when any words are picked.
Refer to caption
Figure 7. Illustration of the mistakes from machine translation. The green sentence is true mean, and the red is the result of machine translation. These two examples show that a tiny mistake that happened during translation may cause complete misunderstanding.
Refer to caption
Figure 8. Illustrate the situation that annotations getting invalid in the machine translation.

3). BiCF does not outperform machine translation when the scale of fed Indonesian data is negligible on the intent classification. When the scale of ID-WOZ data gets larger, its strength of BiCF becomes more obvious. It starts to outperform significantly better than the other methods while the Indonesian data grows. It is capable of avoiding misunderstanding caused by translation and mitigating the shrink effect of the English corpus, which makes it achieve the best performance and even better than the baseline ID-BERT, when the ID-WOZ data reaches around 16k for restaurant, hotel, taxi, attraction domains on the intent classification, i.e., 92.92%, 94.30%, 92.23%, 94.80% on F1 score, respectively. This method outperforms other methods on slot-filling when the ID-WOZ data fed is negligible. Not only it makes use of correct slot-filling annotations from the English dataset, but it can also reduce the bad effects of large-scale English corpus. The accuracy reaches 82.84%, 76.95%, 90.45%, 90.44% on F1 score for restaurant, hotel, taxi, attraction on the slot-filling, respectively. Fig. 4 reports the results of three methods trained by 16K of ID-WOZ. It shows that the cross-lingual method performs better than others when the slots need more words to describe.

7. Conclusion and Future Work

We empirically investigated how to build the low-resource language dialogue understanding models with the English dataset from scratch. Directly translating from English to Indonesian or simply utilizing the multilingual pre-trained model could not perform well. Instead, our framework BiCF enjoys the enriched and accurate English dataset, performs effectively and obtains reliable results. We further release a large Indonesian dialogue dataset and an ID-BERT model for future research.

8. Acknowledgments

This research is supported by the Nature Scientific Foundation of Heilongjiang Province (YQ2021F006)

References

  • (1)
  • Ammar et al. (2016) Waleed Ammar, George Mulcaire, Miguel Ballesteros, Chris Dyer, and Noah A Smith. 2016. Many languages, one parser. Transactions of the Association for Computational Linguistics 4 (2016), 431–444.
  • Artetxe et al. (2017) Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017. Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 451–462.
  • Asri et al. (2017) Layla El Asri, Hannes Schulz, Shikhar Sharma, Jeremie Zumer, Justin Harris, Emery Fine, Rahul Mehrotra, and Kaheer Suleman. 2017. Frames: A corpus for adding memory to goal-oriented dialogue systems. arXiv preprint arXiv:1704.00057 (2017).
  • Budzianowski et al. (2018) Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Inigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gašić. 2018. Multiwoz-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. arXiv preprint arXiv:1810.00278 (2018).
  • Chen et al. (2016) Hongshen Chen, Yue Zhang, and Qun Liu. 2016. Neural network for heterogeneous annotations. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 731–741.
  • Chen et al. (2017) Tao Chen, Ruifeng Xu, Yulan He, and Xuan Wang. 2017. Improving sentiment analysis via sentence type classification using BiLSTM-CRF and CNN. Expert Systems with Applications 72 (2017), 221–230.
  • Cheng (2019) Yong Cheng. 2019. Semi-supervised learning for neural machine translation. In Joint Training for Neural Machine Translation. Springer, 25–40.
  • Chowanda and Chowanda (2017) Andry Chowanda and Alan Darmasaputra Chowanda. 2017. Recurrent neural network to deep learn conversation in indonesian. Procedia computer science 116 (2017), 579–586.
  • Collins (2011) Michael Collins. 2011. Statistical machine translation: IBM models 1 and 2. Columbia Columbia Univ (2011).
  • de Melo (2017) Gerard de Melo. 2017. Multilingual vector representations of words, sentences, and documents. In Proceedings of the IJCNLP 2017, Tutorial Abstracts. 3–5.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018).
  • Dozat and Manning (2016) Timothy Dozat and Christopher D Manning. 2016. Deep biaffine attention for neural dependency parsing. arXiv preprint arXiv:1611.01734 (2016).
  • Duong et al. (2015) Long Duong, Trevor Cohn, Steven Bird, and Paul Cook. 2015. A neural network model for low-resource universal dependency parsing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 339–348.
  • Dyer et al. (2013) Chris Dyer, Victor Chahuneau, and Noah A Smith. 2013. A simple, fast, and effective reparameterization of ibm model 2. (2013).
  • Eric and Manning (2017) Mihail Eric and Christopher D Manning. 2017. Key-value retrieval networks for task-oriented dialogue. arXiv preprint arXiv:1705.05414 (2017).
  • Fleiss et al. (1969) Joseph L Fleiss, Jacob Cohen, and Brian S Everitt. 1969. Large sample standard errors of kappa and weighted kappa. Psychological bulletin 72, 5 (1969), 323.
  • Grave et al. (2018) Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. 2018. Learning Word Vectors for 157 Languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018).
  • Guo et al. (2015) Jiang Guo, Wanxiang Che, David Yarowsky, Haifeng Wang, and Ting Liu. 2015. Cross-lingual dependency parsing based on distributed representations. In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: Long papers). 1234–1244.
  • Joulin et al. (2016) Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. 2016. FastText.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651 (2016).
  • Kelley (1984) John F Kelley. 1984. An iterative design methodology for user-friendly natural language office information applications. ACM Transactions on Information Systems (TOIS) 2, 1 (1984), 26–41.
  • Koto (2016) Fajri Koto. 2016. A publicly available indonesian corpora for automatic abstractive and extractive chat summarization. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). 801–805.
  • Liu et al. (2018) Hui Liu, Qingyu Yin, and William Yang Wang. 2018. Towards Explainable NLP: A Generative Explanation Framework for Text Classification. arXiv preprint arXiv:1811.00196 (2018).
  • Lowe et al. (2015) Ryan Lowe, Nissan Pow, Iulian Serban, and Joelle Pineau. 2015. The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems. arXiv preprint arXiv:1506.08909 (2015).
  • McCann et al. (2017) Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems. 6294–6305.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 311–318.
  • Peters et al. (2018) Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018).
  • Pires et al. (2019) Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How multilingual is Multilingual BERT? arXiv preprint arXiv:1906.01502 (2019).
  • Ramos et al. (2003) Juan Ramos et al. 2003. Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning, Vol. 242. Piscataway, NJ, 133–142.
  • Salton et al. (1982) Gerard Salton, Edward A Fox, and Harry Wu. 1982. Extended Boolean information retrieval. Technical Report. Cornell University.
  • Schuster et al. (2018) Sebastian Schuster, Sonal Gupta, Rushin Shah, and Mike Lewis. 2018. Cross-lingual Transfer Learning for Multilingual Task Oriented Dialog. arXiv preprint arXiv:1810.13327 (2018).
  • Schuster et al. (2019) Tal Schuster, Ori Ram, Regina Barzilay, and Amir Globerson. 2019. Cross-Lingual Alignment of Contextual Word Embeddings, with Applications to Zero-shot Dependency Parsing. arXiv preprint arXiv:1902.09492 (2019).
  • Shah et al. (2018) Pararth Shah, Dilek Hakkani-Tür, Gokhan Tür, Abhinav Rastogi, Ankur Bapna, Neha Nayak, and Larry Heck. 2018. Building a conversational agent overnight with dialogue self-play. arXiv preprint arXiv:1801.04871 (2018).
  • Tho et al. (2018) Cuk Tho, Arden S Setiawan, and Andry Chowanda. 2018. Forming of Dyadic Conversation Dataset for Bahasa Indonesia. Procedia Computer Science 135 (2018), 315–322.
  • Tiedemann (2015) Jörg Tiedemann. 2015. Improving the cross-lingual projection of syntactic dependencies. In Proceedings of the 20th Nordic Conference of Computational Linguistics, NODALIDA 2015, May 11-13, 2015, Vilnius, Lithuania. Linköping University Electronic Press, 191–199.
  • Tiedemann and Agić (2016) Jörg Tiedemann and Zeljko Agić. 2016. Synthetic treebanking for cross-lingual dependency parsing. Journal of Artificial Intelligence Research 55 (2016), 209–248.
  • Vaswani et al. (2018) Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan N Gomez, Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, et al. 2018. Tensor2tensor for neural machine translation. arXiv preprint arXiv:1803.07416 (2018).
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.
  • Wang et al. (2017) Hongmin Wang, Yue Zhang, GuangYong Leonard Chan, Jie Yang, and Hai Leong Chieu. 2017. Universal dependencies parsing for colloquial singaporean english. arXiv preprint arXiv:1705.06463 (2017).
  • Wen et al. (2016) Tsung-Hsien Wen, David Vandyke, Nikola Mrksic, Milica Gasic, Lina M Rojas-Barahona, Pei-Hao Su, Stefan Ultes, and Steve Young. 2016. A network-based end-to-end trainable task-oriented dialogue system. arXiv preprint arXiv:1604.04562 (2016).
  • Williams et al. (2013) Jason Williams, Antoine Raux, Deepak Ramachandran, and Alan Black. 2013. The dialog state tracking challenge. In Proceedings of the SIGDIAL 2013 Conference. 404–413.
  • Wu et al. (2019) Chien-Sheng Wu, Andrea Madotto, Ehsan Hosseini-Asl, Caiming Xiong, Richard Socher, and Pascale Fung. 2019. Transferable Multi-Domain State Generator for Task-Oriented Dialogue Systems. arXiv preprint arXiv:1905.08743 (2019).
  • Wu et al. (2016) Yu Wu, Wei Wu, Chen Xing, Ming Zhou, and Zhoujun Li. 2016. Sequential matching network: A new architecture for multi-turn response selection in retrieval-based chatbots. arXiv preprint arXiv:1612.01627 (2016).
  • Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv preprint arXiv:1906.08237 (2019).
  • Zhang et al. (2019) Meishan Zhang, Yue Zhang, and Guohong Fu. 2019. Cross-Lingual Dependency Parsing Using Code-Mixed TreeBank. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 996–1005.