An Effective Method using Phrase Mechanism in
Neural Machine Translation
Tóm tắt nội dung
Machine Translation is one of the essential tasks in Natural Language Processing (NLP), which has massive applications in real life as well as contributing to other tasks in the NLP research community. Recently, Transformer -based methods have attracted numerous researchers in this domain and achieved state-of-the-art results in most of the pair languages. In this paper, we report an effective method using a phrase mechanism, PhraseTransformer, to improve the strong baseline model Transformer, in constructing a Neural Machine Translation (NMT) system for parallel corpora Vietnamese-Chinese. Our experiments on the MT dataset of the VLSP 2022 competition achieved the BLEU score of 35.3 on Vietnamese to Chinese and 33.2 BLEU scores on Chinese to Vietnamese data. Our code is available at https://github.com/phuongnm94/PhraseTransformer.
1 Introduction
In the NLP area, Machine Translation is the primary task that has a long time history of development, especially in the approaches using Neural Networks (Sutskever et al., 2014; Bahdanau et al., 2015). Besides, there are numerous proposed architectures in this domain that have got a significant effect on the NLP community in many domains (Sutskever et al., 2014; Bahdanau et al., 2015; Vaswani et al., 2017; Devlin et al., 2019). Therefore, in this work, we focus on Machine Translation task in The Ninth International Workshop on Vietnamese Language and Speech Processing (VLSP 2022)111https://vlsp.org.vn/vlsp2022.
VLSP 2022 is a competition of one of the biggest NLP communities in Vietnam, Association for Vietnamese Language and Speech Processing, hosted by the VNU University of Science, Hanoi. This is a good chance for researchers in many domains to introduce their awesome research results and catch the state-of-the-art machine learning models in different domains.
This year, the machine translation task is designed to translate from Vietnamese to Chinese and on the contrary direction. In this task, given the input is a natural sentence in Chinese (or Vietnamese), the machine translation system is required to generate a new sentence in Vietnamese (or Chinese) that has the same meaning as the input sentence (Table 1).
Language | Content |
---|---|
Vietnamese | Tôi vừa có kế hoạch thay họ biểu diễn ở bữa tiệc. |
Chinese | 正打算让这群废物在派对上表演呢 |
(meaning) | I just had the plan to perform on their behalf at the party. |
Vietnamese | Không! Cậu đã lấy đi tất cả những gì quý giá đối với tôi rời, cậu Toretto à! |
Chinese | 我有价值的东西早被你榨取光了 托雷托先生 |
(meaning) | No! You took everything that was precious to me, Mr. Toretto! |
Recently, the self-attention mechanism has occupied a lot of attention, especially the Transformer architecture (Vaswani et al., 2017; Devlin et al., 2019). This architecture leads many state-of-the-art results in various NLP domains (Devlin et al., 2019; Lewis et al., 2020; Raffel et al., 2020). In this work, we applied a new approach based on the Transformer architecture, PhraseTransformer (Nguyen et al., 2023), which leverages a phrase attention mechanism to solve the machine translation task. The main idea of this approach is to enhance the word representation by its local contexts (or phrases), and apply the self-attention mechanism to model the dependencies between phrases in a sentences. The experimental results on the machine translation dataset of the VLSP 2022 workshop show that our PhraseTransformer clearly beat the original Transformer model in both two directions from Vietnamese to Chinese and from Chinese to Vietnamese. Furthermore, the PhraseTransformer does not require any external syntax tree information as the previous works (Yang et al., 2018; Wang et al., 2019; Nguyen et al., 2020; Bugliarello and Okazaki, 2020) and is more lightweight compared with other phrase-level attention models (Xu et al., 2020). To this end, our PhraseTransformer improved the original Transformer by 1.1 BLEU scores on the setting from Vietnamese to Chinese and 1.3 BLEU scores on the setting from Chinese to Vietnamese.
The remainder of this paper, we detail our work in four sections. Section 2, we show the related works in this domain. And describe the details of our PhraseTransformer architecture in section 3. Then, the experiments and result analysis are shown in section 4. Finally, in section 5, we conclude this paper.
2 Related works
In the success of Transformer Vaswani et al. (2017) in the machine translation task, there are a numerous works proposed new directions to improve this architecture. Among them, the approach considering of utilize the phrase linguistic structure representation is promising. The works presented in Yang et al. (2020); Shang et al. (2021); Nguyen et al. (2022) demonstrate that the template prediction can guide an NMT system in the decoding process to improve performance. The works (Wu et al., 2018; Nguyen et al., 2020; Wang et al., 2019) indicate that the phrase information extracted from the syntax tree can improve the meaning representation of sentence. However, the performance of these systems is affected by the quality of the syntax tree extraction step which is usually lower performance in low-resource languages. Close to our PhraseTransformer, Xu et al. (2020) also presents a different method to model the phrase in a source sentence, however, this architecture uses a huge number of parameters to learn the attention scores between source phrases with target words. Compared with this work, by using LSTM architecture (Hochreiter and Schmidhuber, 1997) in the Multi-head layers to model various local context information, our PhraseTransformer increases the model size with a small margin but works effectively.
3 Model architecture
In this section, we describe our PhraseTransformer model (Nguyen et al., 2023), which is used to submit the machine translation task in VLSP 2022 workshop (Figure 1). We define the input vector () is synthesized from the word embedding and positional encoding ( is sentence length) similar to Vaswani et al. (2017). This vector is processed by a Dense layer to get multi-views of input data corresponding to different heads.
(1) |
where is the head index, where is the number of heads in Multi-Head layer. We define a function to extract local context:
(2) |
where is the hyper-parameter gram size, is the sequence of words vector, is the function which capture local context using LSTM model:
Then, the phrase information is integrated to the word representations:
(3) | ||||
(4) | ||||
(5) | ||||
(6) |
After that, these vectors are fed to the Self-Attention layer to learn the dependencies between words and phrases.
(7) |
where is dimension of one head in Multi-Head layer. Finally, all head vector representation is combined to get the final sentence vector () following Vaswani et al. (2017):
(8) | ||||
(9) | ||||
(10) |
This sentence vector is processed via a stack of Identical layers with the same architecture and forwarded to the Transformer Decoder layer (Vaswani et al., 2017) for decoding process.
Training method
The training objective is to maximize the Log-Likelihood function of the probabilities to generate the target sentence () given a source sentence () from an machine translation parallel dataset ():
(11) |
4 Experiments and Results
In this section, we describe in detail about our experiments and the corresponding results.
Dataset.
We use the parallel corpora Chinese-Vietnamese provided by the VLSP 2022 workshop for the training and development process. For more understanding of this dataset, we show the data analysis information on Table 2.
#examples | #vocab | Avg. length | ||||
Train | Dev | Test | Vi | Zh | Vi | Zh |
300,347 | 999 | 999 | 4K | 16K | 18.99 | 22.85 |
Prepossessing.
We follow the previous work (Provilkov et al., 2020) for reprocessing step. To deal with the out-of-vocab problem, we use BPE encode method (Sennrich et al., 2016) to split words into sub-tokens with the number of operators is 4000 and 16000 operators in Vietnamese (Vi) and Chinese (Zh) languages, respectively. In the Chinese language, because there is no space between characters, the BPE segment module directly processes the whole raw sentence as a word segment.
Evaluation metric.
We use the BLEU score to evaluate the quality of the translated sentence. For a fair comparison with other works, we used SacreBleu (Post, 2018) tool222https://github.com/mjpost/sacrebleu. To evaluate the system translated to the Chinese language, we used a word tokenizer default provided by this tool with setting --tok zh. To this end, the output of our machine translation system is compared with the raw sentence of the target language provided by VLSP 2022 parallel corpora to get the BLEU score. The signature of our evaluation setting on both translation directions are:
-
•
Vietnamese-Chinese: BLEU+case.mixed +lang.vi-zh +numrefs.1 +smooth.exp +tok.zh +version.1.5.1
-
•
Chinese-Vietnamese: BLEU+case.mixed +lang.zh-vi +numrefs.1 +smooth.exp +tok.13a +version.1.5.1
Experimental setting.
For comparison between our PhraseTransformer and the original Transformer Vaswani et al. (2017), we implemented to train and evaluate these models on the parallel corpora Chinese-Vietnamese VLSP 2022 without any external data or pre-trained model. We conducted all experiments with the same setting on server NVIDIA A40. The hidden size is ; the learning rate is initialized by ; the learning warm-up step is ; both the transformer encoder and decoder contain 6 layers, 4 heads. In the training process, the batch size is 4096 tokens, and the maximum number of epochs is . Following Provilkov et al. (2020), we average 5 latest checkpoints333The script is used to get average 5 latest checkpoints: https://github.com/facebookresearch/fairseq/blob/main/scripts/average_checkpoints.py to get a final model for the translation testing process.
Results.
We report the the experimental results of machine translation systems from Chinese to Vietnamese and from Vietnamese to Chinese on Table 3, and Table 4, respectively444The underline setting in these tables is the model we submitted at the time of the VLSP 2022 competition.. Based on these results, we found that our PhraseTransformer can beat the original Transformer with almost settings of gram sizes. These results proved the effectiveness of our phrase modeling mechanism, which supports the translation system better in sentence meaning representation. Besides, our model is work well without any external syntax tree information that makes the PhraseTransformer architecture can widely adaptable to many other languages as well as other NLP tasks.
Model | gram sizes () | Dev | Test |
---|---|---|---|
Transformer | - | 30.4 | 31.9 |
PhraseTransformer | {3} | 31.0 | 33.0 |
PhraseTransformer | {4} | 30.9 | 32.7 |
PhraseTransformer | {2,3} | 30.8 | 33.2 |
Model | gram sizes () | Dev | Test |
---|---|---|---|
Transformer | - | 29.5 | 34.2 |
PhraseTransformer | {3} | 30.1 | 35.0 |
PhraseTransformer | {4} | 30.1 | 35.3 |
PhraseTransformer | {2,3} | 30.0 | 34.1 |
Improvement example.
For more understanding about the improvement of PhraseTransformer compared with the original Transformer, we observe the difference between the outputs of two models and show it in Table 5. In the first row, the original Transformer has missed a part of the information in the translation process, “các dịch xã hội cơ bản như thông tin” (basic social services such as information), while our PhraseTransformer can recognize it. Besides, in the second row, the output of the original Transformer is lack the smoothness of the natural language “đặt mức độ suy giảm khủng hoảng trung bình” that makes the translated sentence is misunderstood “4 loại dịch bệnh”. We argue that because our PhraseTransformer captured the dependencies between phrases, that make model is better in capture the sub-conditions of the long sentence. In addition, our PhraseTransformer captured the local context in the phrases which supports improving the meaning representation in the translation process.
Input | 这旨在面向全面保障贫困人口的社会民生权利,提高贫困人口获得信息、就业、医疗、教育、住房、生活用水、卫生、环境等基本社会服务的机会。 |
---|---|
Gold sentence | Điều này hướng tới đảm bảo toàn diện quyền an sinh xã hội của người nghèo, cải thiện mức độ tiếp cận của người nghèo đối với các dịch vụ xã hội cơ bản như thông tin, việc làm, y tế, giáo dục, nhà ở, nước sinh hoạt và vệ sinh, môi trường. |
Transformer | Đây là cơ hội để đảm bảo toàn diện các quyền an sinh xã hội của người nghèo, nâng cao việc làm, y tế, nhà ở, nước sinh hoạt, vệ sinh, môi trường. |
PhraseTransformer | Đây là dịp để đảm bảo toàn diện các quyền an sinh xã hội cho người nghèo, nâng cao các dịch vụ xã hội cơ bản như thông tin, việc làm, y tế, giáo dục, nhà ở, nước sinh hoạt, vệ sinh, môi trường. |
Input | 政府将设低危机、中等危机、高危机、最高危机4个疫情级别。 |
Gold sentence | Chính phủ đã phân loại 4 cấp độ dịch: nguy cơ thấp (bình thường mới), nguy cơ trung bình, nguy cơ cao, nguy cơ rất cao. |
Transformer | Chính phủ sẽ đặt mức độ suy giảm khủng hoảng trung bình, khủng hoảng cao, nguy cơ cao nhất, 4 loại dịch bệnh. |
PhraseTransformer | Chính phủ sẽ có 4 cấp độ dịch bệnh thấp, nguy cơ trung bình, nguy cơ cao và khủng hoảng cao nhất. |
5 Conclusion
In this paper, we first applied our PhraseTransformer model to the machine translation task for the pair of languages Vietnamese - Chinese. In this architecture, the original Transformer Encoder is enhanced by incorporating the phrase dependencies information into the Self-Attention mechanism. Our experimental results showed that our PhaseTransformer beat the original Transformer by a large margin in both translation directions. In future work, we would like to apply our model architecture to other NLP tasks and explore other effective phrase modeling methods to achieve better results.
Tài liệu
- Bahdanau et al. (2015) Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. 3rd International Conference on Learning Representations, ICLR 2015 ; Conference date: 07-05-2015 Through 09-05-2015.
- Bugliarello and Okazaki (2020) Emanuele Bugliarello and Naoaki Okazaki. 2020. Enhancing machine translation with dependency-aware self-attention. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1618–1627, Online. Association for Computational Linguistics.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput., 9(8):1735–1780.
- Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
- Nguyen et al. (2022) Phuong Nguyen, Tung Le, Thanh-Le Ha, Thai Dang, Khanh Tran, Kim Anh Nguyen, and Nguyen Le Minh. 2022. Improving neural machine translation by efficiently incorporating syntactic templates. In Advances and Trends in Artificial Intelligence. Theory and Practices in Artificial Intelligence, pages 303–314, Cham. Springer International Publishing.
- Nguyen et al. (2023) Phuong Minh Nguyen, Tung Le, Huy Tien Nguyen, Vu Tran, and Minh Le Nguyen. 2023. Phrasetransformer: an incorporation of local context information into sequence-to-sequence semantic parsing. Applied Intelligence, 53(12):15889–15908.
- Nguyen et al. (2020) Xuan-Phi Nguyen, Shafiq Joty, Steven Hoi, and Richard Socher. 2020. Tree-structured attention with hierarchical accumulation. In International Conference on Learning Representations.
- Post (2018) Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Belgium, Brussels. Association for Computational Linguistics.
- Provilkov et al. (2020) Ivan Provilkov, Dmitrii Emelianenko, and Elena Voita. 2020. BPE-dropout: Simple and effective subword regularization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1882–1892, Online. Association for Computational Linguistics.
- Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
- Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
- Shang et al. (2021) Wei Shang, Chong Feng, Tianfu Zhang, and Da Xu. 2021. Guiding neural machine translation with retrieved translation template. 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–7.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, pages 5998–6008. Curran Associates, Inc.
- Wang et al. (2019) Yaushian Wang, Hung-Yi Lee, and Yun-Nung Chen. 2019. Tree transformer: Integrating tree structures into self-attention. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1061–1070, Hong Kong, China. Association for Computational Linguistics.
- Wu et al. (2018) Wei Wu, Houfeng Wang, Tianyu Liu, and Shuming Ma. 2018. Phrase-level self-attention networks for universal sentence encoding. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3729–3738, Brussels, Belgium. Association for Computational Linguistics.
- Xu et al. (2020) Hongfei Xu, Josef van Genabith, Deyi Xiong, Qiuhui Liu, and Jingyi Zhang. 2020. Learning source phrase representations for neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 386–396, Online. Association for Computational Linguistics.
- Yang et al. (2018) Baosong Yang, Zhaopeng Tu, Derek F. Wong, Fandong Meng, Lidia S. Chao, and Tong Zhang. 2018. Modeling localness for self-attention networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4449–4458, Brussels, Belgium. Association for Computational Linguistics.
- Yang et al. (2020) Jian Yang, Shuming Ma, Dongdong Zhang, Zhoujun Li, and Ming Zhou. 2020. Improving neural machine translation with soft template prediction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5979–5989, Online. Association for Computational Linguistics.