An Effective Method using Phrase Mechanism in
Neural Machine Translation

Phuong Nguyen Minh Le Nguyen
Japan Advanced Institute of Science and Technology, Japan
{phuongnm, nguyenml}@jaist.ac.jp

Tóm tắt nội dung

Machine Translation is one of the essential tasks in Natural Language Processing (NLP), which has massive applications in real life as well as contributing to other tasks in the NLP research community. Recently, Transformer -based methods have attracted numerous researchers in this domain and achieved state-of-the-art results in most of the pair languages. In this paper, we report an effective method using a phrase mechanism, PhraseTransformer, to improve the strong baseline model Transformer, in constructing a Neural Machine Translation (NMT) system for parallel corpora Vietnamese-Chinese. Our experiments on the MT dataset of the VLSP 2022 competition achieved the BLEU score of 35.3 on Vietnamese to Chinese and 33.2 BLEU scores on Chinese to Vietnamese data. Our code is available at https://github.com/phuongnm94/PhraseTransformer.

1 Introduction

In the NLP area, Machine Translation is the primary task that has a long time history of development, especially in the approaches using Neural Networks (Sutskever et al., 2014; Bahdanau et al., 2015). Besides, there are numerous proposed architectures in this domain that have got a significant effect on the NLP community in many domains (Sutskever et al., 2014; Bahdanau et al., 2015; Vaswani et al., 2017; Devlin et al., 2019). Therefore, in this work, we focus on Machine Translation task in The Ninth International Workshop on Vietnamese Language and Speech Processing (VLSP 2022)¹¹1https://vlsp.org.vn/vlsp2022.

VLSP 2022 is a competition of one of the biggest NLP communities in Vietnam, Association for Vietnamese Language and Speech Processing, hosted by the VNU University of Science, Hanoi. This is a good chance for researchers in many domains to introduce their awesome research results and catch the state-of-the-art machine learning models in different domains.

This year, the machine translation task is designed to translate from Vietnamese to Chinese and on the contrary direction. In this task, given the input is a natural sentence in Chinese (or Vietnamese), the machine translation system is required to generate a new sentence in Vietnamese (or Chinese) that has the same meaning as the input sentence (Table 1).

Language	Content
Vietnamese	Tôi vừa có kế hoạch thay họ biểu diễn ở bữa tiệc.
Chinese	正打算让这群废物在派对上表演呢
(meaning)	I just had the plan to perform on their behalf at the party.
Vietnamese	Không! Cậu đã lấy đi tất cả những gì quý giá đối với tôi rời, cậu Toretto à!
Chinese	我有价值的东西早被你榨取光了托雷托先生
(meaning)	No! You took everything that was precious to me, Mr. Toretto!

Table 1: Example of machine translation task in the parallel Vietnamese-Chinese corpus of VLSP 2022. The (meaning) row is to support the readers of this paper which are not provided in dataset.

Recently, the self-attention mechanism has occupied a lot of attention, especially the Transformer architecture (Vaswani et al., 2017; Devlin et al., 2019). This architecture leads many state-of-the-art results in various NLP domains (Devlin et al., 2019; Lewis et al., 2020; Raffel et al., 2020). In this work, we applied a new approach based on the Transformer architecture, PhraseTransformer (Nguyen et al., 2023), which leverages a phrase attention mechanism to solve the machine translation task. The main idea of this approach is to enhance the word representation by its local contexts (or phrases), and apply the self-attention mechanism to model the dependencies between phrases in a sentences. The experimental results on the machine translation dataset of the VLSP 2022 workshop show that our PhraseTransformer clearly beat the original Transformer model in both two directions from Vietnamese to Chinese and from Chinese to Vietnamese. Furthermore, the PhraseTransformer does not require any external syntax tree information as the previous works (Yang et al., 2018; Wang et al., 2019; Nguyen et al., 2020; Bugliarello and Okazaki, 2020) and is more lightweight compared with other phrase-level attention models (Xu et al., 2020). To this end, our PhraseTransformer improved the original Transformer by 1.1 BLEU scores on the setting from Vietnamese to Chinese and 1.3 BLEU scores on the setting from Chinese to Vietnamese.

The remainder of this paper, we detail our work in four sections. Section 2, we show the related works in this domain. And describe the details of our PhraseTransformer architecture in section 3. Then, the experiments and result analysis are shown in section 4. Finally, in section 5, we conclude this paper.

2 Related works

In the success of Transformer Vaswani et al. (2017) in the machine translation task, there are a numerous works proposed new directions to improve this architecture. Among them, the approach considering of utilize the phrase linguistic structure representation is promising. The works presented in Yang et al. (2020); Shang et al. (2021); Nguyen et al. (2022) demonstrate that the template prediction can guide an NMT system in the decoding process to improve performance. The works (Wu et al., 2018; Nguyen et al., 2020; Wang et al., 2019) indicate that the phrase information extracted from the syntax tree can improve the meaning representation of sentence. However, the performance of these systems is affected by the quality of the syntax tree extraction step which is usually lower performance in low-resource languages. Close to our PhraseTransformer, Xu et al. (2020) also presents a different method to model the phrase in a source sentence, however, this architecture uses a huge number of parameters to learn the attention scores between source phrases with target words. Compared with this work, by using LSTM architecture (Hochreiter and Schmidhuber, 1997) in the Multi-head layers to model various local context information, our PhraseTransformer increases the model size with a small margin but works effectively.

3 Model architecture

In this section, we describe our PhraseTransformer model (Nguyen et al., 2023), which is used to submit the machine translation task in VLSP 2022 workshop (Figure 1). We define the input vector ( $\mathbf{x}$ ) is synthesized from the word embedding and positional encoding $\mathbf{x}=[\mathbf{x}_{1},...,\mathbf{x}_{|S|}]$ ( $|S|$ is sentence length) similar to Vaswani et al. (2017). This vector is processed by a Dense layer to get multi-views of input data corresponding to different heads.

\displaystyle\mathbf{q}_{i},\mathbf{k}_{i},\mathbf{v}_{i}

\displaystyle=\mathbf{x}\mathbf{W}_{i}^{q},\mathbf{x}\mathbf{W}_{i}^{k},\mathbf{x}\mathbf{W}_{i}^{v}

(1)

where $i$ is the head index, $1\leq i\leq{H}$ where $H$ is the number of heads in Multi-Head layer. We define a $\operatorname{Phrase}$ function to extract local context:

(2)

where $m_{t}$ is the hyper-parameter gram size, $\mathbf{s}_{i}$ is the sequence of words vector, $n\mathrm{\_gramLSTM}_{k}$ is the function which capture local context using LSTM model:

	$\displaystyle m_{t}\mathrm{\_gram}_{k}(\mathbf{s}_{i})$	$\displaystyle=[\mathbf{H}_{k-n_{i}+1},\mathbf{H}_{k-n_{i}+2},...,\mathbf{H}_{k}]$
	$\displaystyle n\mathrm{\_gramLSTM}_{k}(\mathbf{s}_{i})$	$\displaystyle=\mathrm{LSTM}^{f}_{i}(m_{t}\mathrm{\_gram}_{k}(\mathbf{s}_{i}))+$
		$\displaystyle\quad\quad\mathrm{LSTM}^{b}_{i}(m_{t}\mathrm{\_gram}_{k}(\mathbf{s}_{i}))$

Then, the phrase information is integrated to the word representations:

$\displaystyle\operatorname{zip}(A,B)$	$\displaystyle=[A^{\intercal};B^{\intercal}]^{\intercal}$	(3)
$\displaystyle\mathbf{ph}^{k}_{i,m_{t}},\mathbf{ph}^{q}_{i,m_{t}}$	$\displaystyle=\operatorname{unzip}(\operatorname{Phrase}(\operatorname{zip}(\mathbf{q}_{i},\mathbf{k}_{i}),m_{t}))$	(4)
$\displaystyle\mathbf{ph}^{q}_{i}$	$\displaystyle=\operatorname{zip}(\{\mathbf{ph}^{q}_{i,m_{t}}\mid m_{t}\in\mathbf{m}\})$	(5)
$\displaystyle\mathbf{ph}^{k}_{i}$	$\displaystyle=\operatorname{zip}(\{\mathbf{ph}^{k}_{i,m_{t}}\mid m_{t}\in\mathbf{m}\})$	(6)

After that, these vectors are fed to the Self-Attention layer to learn the dependencies between words and phrases.

\displaystyle\mathbf{head}_{i}

\displaystyle=\mathrm{softmax}(\frac{\mathbf{ph}^{q}_{i}\,\,{\mathbf{ph}^{k}_{i}}^{\intercal}}{\sqrt{d_{h}}})\,\mathbf{v}_{i}

(7)

where $d_{h}$ is dimension of one head in Multi-Head layer. Finally, all head vector representation is combined to get the final sentence vector ( $\mathbf{h}_{out}$ ) following Vaswani et al. (2017):

$\displaystyle\mathbf{h}_{mh}$	$\displaystyle=[\mathbf{head}_{1};...;\mathbf{head}_{H}]\mathbf{W}^{o}$	(8)
$\displaystyle\mathbf{h}_{no}$	$\displaystyle=\mathrm{LayerNorm}(\mathbf{h}_{mh}+\mathbf{x})$	(9)
$\displaystyle\mathbf{h}_{out}$	$\displaystyle=\mathrm{LayerNorm}(\mathrm{FeedForward}(\mathbf{h}_{no})+\mathbf{h}_{no})$	(10)

This sentence vector is processed via a stack of $N$ Identical layers with the same architecture and forwarded to the Transformer Decoder layer (Vaswani et al., 2017) for decoding process.

Refer to caption — Figure 1: Overview of PhraseTransformer (CrossH) using $n$ _gram LSTM in MultiHead layer. In this case, the phrase representations are built with $\mathrm{gram\_size}=\{2,3\}$ , 2-gram, 3-gram models apply to all 8 heads.

Training method

The training objective is to maximize the Log-Likelihood function of the probabilities to generate the target sentence ( $y$ ) given a source sentence ( $x$ ) from an machine translation parallel dataset ( $\mathcal{D}$ ):

\displaystyle\mathrm{maximize}:\sum_{<x,y>\in\mathcal{D}}\log\,p_{\theta}\,(y\mid x)

(11)

4 Experiments and Results

In this section, we describe in detail about our experiments and the corresponding results.

Dataset.

We use the parallel corpora Chinese-Vietnamese provided by the VLSP 2022 workshop for the training and development process. For more understanding of this dataset, we show the data analysis information on Table 2.

#examples			#vocab		Avg. length
Train	Dev	Test	Vi	Zh	Vi	Zh
300,347	999	999	4K	16K	18.99	22.85

Table 2: Statistics information of parallel corpora Chinese-Vietnamese in VLSP 2022 workshop. Vocabulary size and average length of source (Src) and target (Tgt) side are computed on training set.

Prepossessing.

We follow the previous work (Provilkov et al., 2020) for reprocessing step. To deal with the out-of-vocab problem, we use BPE encode method (Sennrich et al., 2016) to split words into sub-tokens with the number of operators is 4000 and 16000 operators in Vietnamese (Vi) and Chinese (Zh) languages, respectively. In the Chinese language, because there is no space between characters, the BPE segment module directly processes the whole raw sentence as a word segment.

Evaluation metric.

We use the BLEU score to evaluate the quality of the translated sentence. For a fair comparison with other works, we used SacreBleu (Post, 2018) tool²²2https://github.com/mjpost/sacrebleu. To evaluate the system translated to the Chinese language, we used a word tokenizer default provided by this tool with setting --tok zh. To this end, the output of our machine translation system is compared with the raw sentence of the target language provided by VLSP 2022 parallel corpora to get the BLEU score. The signature of our evaluation setting on both translation directions are:

•

Vietnamese-Chinese: BLEU+case.mixed +lang.vi-zh +numrefs.1 +smooth.exp +tok.zh +version.1.5.1
•

Chinese-Vietnamese: BLEU+case.mixed +lang.zh-vi +numrefs.1 +smooth.exp +tok.13a +version.1.5.1

Experimental setting.

For comparison between our PhraseTransformer and the original Transformer Vaswani et al. (2017), we implemented to train and evaluate these models on the parallel corpora Chinese-Vietnamese VLSP 2022 without any external data or pre-trained model. We conducted all experiments with the same setting on server NVIDIA A40. The hidden size is $512$ ; the learning rate is initialized by $7e^{-4}$ ; the learning warm-up step is $6000$ ; both the transformer encoder and decoder contain 6 layers, 4 heads. In the training process, the batch size is 4096 tokens, and the maximum number of epochs is $100$ . Following Provilkov et al. (2020), we average 5 latest checkpoints³³3The script is used to get average 5 latest checkpoints: https://github.com/facebookresearch/fairseq/blob/main/scripts/average_checkpoints.py to get a final model for the translation testing process.

Results.

We report the the experimental results of machine translation systems from Chinese to Vietnamese and from Vietnamese to Chinese on Table 3, and Table 4, respectively⁴⁴4The underline setting in these tables is the model we submitted at the time of the VLSP 2022 competition.. Based on these results, we found that our PhraseTransformer can beat the original Transformer with almost settings of gram sizes. These results proved the effectiveness of our phrase modeling mechanism, which supports the translation system better in sentence meaning representation. Besides, our model is work well without any external syntax tree information that makes the PhraseTransformer architecture can widely adaptable to many other languages as well as other NLP tasks.

Model	gram sizes ( $\mathbf{m}$ )	Dev	Test
Transformer	-	30.4	31.9
PhraseTransformer	{3}	31.0	33.0
PhraseTransformer	{4}	30.9	32.7
PhraseTransformer	{2,3}	30.8	33.2

Table 3: Experimental result using SacreBLEU of translation system from Chinese to Vietnamese. The test set is the public test provided at the time of the competition.

Model	gram sizes ( $\mathbf{m}$ )	Dev	Test
Transformer	-	29.5	34.2
PhraseTransformer	{3}	30.1	35.0
PhraseTransformer	{4}	30.1	35.3
PhraseTransformer	{2,3}	30.0	34.1

Table 4: The experimental result using SacreBLEU of translation system from Vietnamese to Chinese.

Improvement example.

For more understanding about the improvement of PhraseTransformer compared with the original Transformer, we observe the difference between the outputs of two models and show it in Table 5. In the first row, the original Transformer has missed a part of the information in the translation process, “các dịch xã hội cơ bản như thông tin” (basic social services such as information), while our PhraseTransformer can recognize it. Besides, in the second row, the output of the original Transformer is lack the smoothness of the natural language “đặt mức độ suy giảm khủng hoảng trung bình” that makes the translated sentence is misunderstood “4 loại dịch bệnh”. We argue that because our PhraseTransformer captured the dependencies between phrases, that make model is better in capture the sub-conditions of the long sentence. In addition, our PhraseTransformer captured the local context in the phrases which supports improving the meaning representation in the translation process.

Input	这旨在面向全面保障贫困人口的社会民生权利，提高贫困人口获得信息、就业、医疗、教育、住房、生活用水、卫生、环境等基本社会服务的机会。
Gold sentence	Điều này hướng tới đảm bảo toàn diện quyền an sinh xã hội của người nghèo, cải thiện mức độ tiếp cận của người nghèo đối với các dịch vụ xã hội cơ bản như thông tin, việc làm, y tế, giáo dục, nhà ở, nước sinh hoạt và vệ sinh, môi trường.
Transformer	Đây là cơ hội để đảm bảo toàn diện các quyền an sinh xã hội của người nghèo, nâng cao việc làm, y tế, nhà ở, nước sinh hoạt, vệ sinh, môi trường.
PhraseTransformer	Đây là dịp để đảm bảo toàn diện các quyền an sinh xã hội cho người nghèo, nâng cao các dịch vụ xã hội cơ bản như thông tin, việc làm, y tế, giáo dục, nhà ở, nước sinh hoạt, vệ sinh, môi trường.
Input	政府将设低危机、中等危机、高危机、最高危机4个疫情级别。
Gold sentence	Chính phủ đã phân loại 4 cấp độ dịch: nguy cơ thấp (bình thường mới), nguy cơ trung bình, nguy cơ cao, nguy cơ rất cao.
Transformer	Chính phủ sẽ đặt mức độ suy giảm khủng hoảng trung bình, khủng hoảng cao, nguy cơ cao nhất, 4 loại dịch bệnh.
PhraseTransformer	Chính phủ sẽ có 4 cấp độ dịch bệnh thấp, nguy cơ trung bình, nguy cơ cao và khủng hoảng cao nhất.

Table 5: Improvement examples of the PhraseTransformer compared with Transformer in the Chinese to Vietnamese test set. The red part is the main difference between these models.

5 Conclusion

In this paper, we first applied our PhraseTransformer model to the machine translation task for the pair of languages Vietnamese - Chinese. In this architecture, the original Transformer Encoder is enhanced by incorporating the phrase dependencies information into the Self-Attention mechanism. Our experimental results showed that our PhaseTransformer beat the original Transformer by a large margin in both translation directions. In future work, we would like to apply our model architecture to other NLP tasks and explore other effective phrase modeling methods to achieve better results.

Tài liệu

Bahdanau et al. (2015) Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. 3rd International Conference on Learning Representations, ICLR 2015 ; Conference date: 07-05-2015 Through 09-05-2015.
Bugliarello and Okazaki (2020) Emanuele Bugliarello and Naoaki Okazaki. 2020. Enhancing machine translation with dependency-aware self-attention. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1618–1627, Online. Association for Computational Linguistics.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput., 9(8):1735–1780.
Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
Nguyen et al. (2022) Phuong Nguyen, Tung Le, Thanh-Le Ha, Thai Dang, Khanh Tran, Kim Anh Nguyen, and Nguyen Le Minh. 2022. Improving neural machine translation by efficiently incorporating syntactic templates. In Advances and Trends in Artificial Intelligence. Theory and Practices in Artificial Intelligence, pages 303–314, Cham. Springer International Publishing.
Nguyen et al. (2023) Phuong Minh Nguyen, Tung Le, Huy Tien Nguyen, Vu Tran, and Minh Le Nguyen. 2023. Phrasetransformer: an incorporation of local context information into sequence-to-sequence semantic parsing. Applied Intelligence, 53(12):15889–15908.
Nguyen et al. (2020) Xuan-Phi Nguyen, Shafiq Joty, Steven Hoi, and Richard Socher. 2020. Tree-structured attention with hierarchical accumulation. In International Conference on Learning Representations.
Post (2018) Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Belgium, Brussels. Association for Computational Linguistics.
Provilkov et al. (2020) Ivan Provilkov, Dmitrii Emelianenko, and Elena Voita. 2020. BPE-dropout: Simple and effective subword regularization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1882–1892, Online. Association for Computational Linguistics.
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
Shang et al. (2021) Wei Shang, Chong Feng, Tianfu Zhang, and Da Xu. 2021. Guiding neural machine translation with retrieved translation template. 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–7.
Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, pages 5998–6008. Curran Associates, Inc.
Wang et al. (2019) Yaushian Wang, Hung-Yi Lee, and Yun-Nung Chen. 2019. Tree transformer: Integrating tree structures into self-attention. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1061–1070, Hong Kong, China. Association for Computational Linguistics.
Wu et al. (2018) Wei Wu, Houfeng Wang, Tianyu Liu, and Shuming Ma. 2018. Phrase-level self-attention networks for universal sentence encoding. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3729–3738, Brussels, Belgium. Association for Computational Linguistics.
Xu et al. (2020) Hongfei Xu, Josef van Genabith, Deyi Xiong, Qiuhui Liu, and Jingyi Zhang. 2020. Learning source phrase representations for neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 386–396, Online. Association for Computational Linguistics.
Yang et al. (2018) Baosong Yang, Zhaopeng Tu, Derek F. Wong, Fandong Meng, Lidia S. Chao, and Tong Zhang. 2018. Modeling localness for self-attention networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4449–4458, Brussels, Belgium. Association for Computational Linguistics.
Yang et al. (2020) Jian Yang, Shuming Ma, Dongdong Zhang, Zhoujun Li, and Ming Zhou. 2020. Improving neural machine translation with soft template prediction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5979–5989, Online. Association for Computational Linguistics.

An Effective Method using Phrase Mechanism in Neural Machine Translation