ɔ\fon \newunicodecharɖ\fon
FonMTL: Towards Multitask Learning for the Fon Language
Abstract
The Fon language, spoken by an average 2 million of people, is a truly low-resourced African language, with a limited online presence, and existing datasets (just to name but a few). Multitask learning is a learning paradigm that aims to improve the generalization capacity of a model by sharing knowledge across different but related tasks: this could be prevalent in very data-scarce scenarios. In this paper, we present the first explorative approach to multitask learning, for model capabilities enhancement in Natural Language Processing for the Fon language. Specifically, we explore the tasks of Named Entity Recognition (NER) and Part of Speech Tagging (POS) for Fon. We leverage two language model heads as encoders to build shared representations for the inputs, and we use linear layers blocks for classification relative to each task. Our results on the NER and POS tasks for Fon, show competitive (or better) performances compared to several multilingual pretrained language models finetuned on single tasks. Additionally, we perform a few ablation studies to leverage the efficiency of two different loss combination strategies and find out that the equal loss weighting approach works best in our case. Our code is open-sourced at https://github.com/bonaventuredossou/multitask_fon.
1 Introduction
Learning one task at a time might be ineffective and prone to overfitting, low data efficiency, or slow learning (Zhang et al., 2022), especially for large and complex problems. This might sometimes lead to the training of very dedicated models with generalization capacity only local to the tasks they have been trained on. Multitask Learning (MTL), precisely for that reason targets a more efficient way of learning, by sharing common representations of the inputs and by implicitly transferring information among the various tasks (Caruana, 1993, 1997). MTL efficiency and promises have been demonstrated in the literature (Gessler and Zeldes, 2022; Ruder, 2017) along with the benefit of the understanding of tasks (their similarity, relationship, hierarchy) for MTL.
For African low-resourced languages in general, building NLP models on specific tasks can be in general difficult for several reasons, including the availability of datasets (Joshi et al., 2020a; Emezue and Dossou, 2021; Nekoto et al., 2020; Dossou and Emezue, 2020). This is even more true for the Fon language (see Appendix A), a truly low-resourced African language. Therefore, MTL appears to be a promising approach to exploring downstream tasks in the Fon language.
The task of Part-of-Speech (POS) tagging is to assign grammatical groups (i.e., whether it is a noun, an adjective, a verb, an adverb, and more) to words in a sentence using contextual cues and assigning corresponding tags. On the other hand, Named Entity Recognition (NER) tries to find out whether or not a word is a named entity (persons, locations, organizations, time expressions, and more). This problem is twofold: detection of names and categorizing names. Consequently, whereas POS is more of a global problem since there can be relationships between the first and the last word of a sentence, NER is rather local, as named entities are not spread in a sentence and mostly consist of uni, bi, or trigrams. Albeit different, and adding the fact that both tasks are classification tasks, we then speculate that both tasks could benefit each other.
2 Experiments and Results
In this paper, we explore the first multitask learning model for the Fon language, particularly, focusing on the NER and POS tasks. We used a hard parameter sharing approach — the most popular MTL method used in the literature (Caruana, 1993; Ruder, 2017) and falls into the joint training method described by (Zhang et al., 2022) in their taxonomy of MTL approaches for NLP.
Experiments:
For NER, we used the Fon dataset of MaskhaNER 2.0 (Adelani et al., 2022), a NER dataset of 20 African languages, including Fon. For the POS task, we used the MasakhaPOS dataset (Dione et al., 2023), the largest POS dataset for 20 typologically diverse African languages, including Fon. In MassakhaNER 2.0, we have 4343/621/1240 as train/dev/test set sizes, with a total of 173099 tokens. In MasakhaPOS we have 798/159/637 as train/dev/test set sizes, with a total of 49460 tokens and 30.6 tokens on average per sentence. For our experiments we leveraged two setups:
-
•
pre-train on all languages of both datasets, and evaluate on the Fon test sets (for NER and POS),
-
•
pre-train only on Fon data from both datasets, and evaluate on the Fon test sets.
In our MTL model (Figure 1), we use two language model (LM) heads: AfroLM-Large (Dossou et al., 2022) and XLMR-Large (Conneau et al., 2020). AfroLM has been pretrained from scratch on 23 African languages in an active learning framework (see Section 3 in (Dossou et al., 2022, 2023)), while XLMR-Large has been pretrained on 100 languages (with 5 African). Each LM head is used to build representations of the inputs from each task, and both representations are then combined (in a multiplicative way) to build a shared representation across models and tasks. Following the shared representation, we built two linear layers, serving as classification layers for each respective task.

We combined the cross-entropy losses of both tasks in two ways: (a) unweighted sum (Kurin et al., 2022) i.e., the unitary sum of each individual loss, and (b) equally weighted where losses are weighted by two parameters and then added up together. For simplicity, in our experiments, we set . We used a small batch size of 4, the AdamW (Loshchilov and Hutter, 2017) as optimizer with a learning rate of 3e-5.
Results:
Along with our MTL variants, we evaluated several multilingual pretrained language models (MPLMs) on the NER task: AfroXLMR (Alabi et al., 2022) (base and large), AfroLM-Large (Dossou et al., 2022), AfriBerta-Large (Ogueji et al., 2021), and XLMR (Conneau et al., 2020) (base and large). AfriBerta is a multilingual model pretrained on 11 African languages, while AfroXLMR is an adapted version of XLMR to 17 African Languages via Multilingual Adaptive Fine-Tuning.
In Tables 1 and 2, we can see that our MTL models are competitive, and in some cases outperformed some single-task baselines; for both NER and POS tasks: This means that as we speculated earlier, both tasks data and representations are informative and useful for the downstream performances. Our results are also backed-up by several empirical results (Liu et al., 2019; Crawshaw, 2020; Gong et al., 2019) on various tasks like Natural Language Understanding (NLU) and GLUE tasks. Furthermore, we can see that the equal-weighted sum of both losses worked better than the unweighted sum (Gong et al., 2019; Rengasamy et al., 2020), which makes sense as we treated both tasks equally. It is also important to note that the efficiency of our model (on both tasks), when trained on all languages is higher than when trained solely on the data for Fon: this means that Fon benefited from the information from other languages. This could be due to various factors (features) like geographical distance, phonological distance, entity overlap, just to name but a few Adelani et al. (2022).
Models | F1-Score |
---|---|
AfroLM-Large (single task; baseline; ∗) | 80.48 |
AfriBerta-Large (single task; baseline; ∗) | 79.90 |
XLMR-Base (single task; baseline; ∗) | 81.90 |
XLMR-Large (single task; baseline; ∗) | 81.60 |
AfroXLMR-Base (single task; baseline; ∗) | 82.30 |
AfroXLMR-Large (single task; baseline; ∗) | 82.70 |
MTL Sum (multi-task; ours; ∗) | 79.87 |
MTL Weighted (multi-task; ours; ∗) | 81.92 |
MTL Weighted (multi-task; ours; +) | 64.43 |
Models | Accuracy Score |
---|---|
AfroLM-Large (single task; baseline; ∗) | 82.40 |
AfriBerta-Large (single task; baseline; ∗) | 88.40 |
XLMR-Base (single task; baseline; ∗) | 90.10 |
XLMR-Large (single task; baseline; ∗) | 90.20 |
AfroXLMR-Base (single task; baseline; ∗) | 90.10 |
AfroXLMR-Large (single task; baseline; ∗) | 90.40 |
MTL Sum (multi-task; ours; ∗) | 82.45 |
MTL Weighted (multi-task; ours; ∗) | 89.20 |
MTL Weighted (multi-task; ours; +) | 80.85 |
Merging Type | Models | Task | Metric | Metric Value |
---|---|---|---|---|
Multiplicative | MTL Weighted (multi-task; ours; ∗) | NER | F1-Score | 81.92 |
Multiplicative | MTL Weighted (multi-task; ours; +) | NER | F1-Score | 64.43 |
Multiplicative | MTL Weighted (multi-task; ours; ∗) | POS | Accuracy | 89.20 |
Multiplicative | MTL Weighted (multi-task; ours; +) | POS | Accuracy | 80.85 |
Additive | MTL Weighted (multi-task; ours; ∗) | NER | F1-Score | 78.91 |
Additive | MTL Weighted (multi-task; ours; +) | NER | F1-Score | 60.93 |
Additive | MTL Weighted (multi-task; ours; ∗) | POS | Accuracy | 86.99 |
Additive | MTL Weighted (multi-task; ours; +) | POS | Accuracy | 78.25 |
Our last analysis consisted of comparing the most efficient way of merging the representations from both language model heads. The results are presented in Table 3. We can see that the multiplicative approach provided better results than the additive one. Our intuition is that multiplication is closer to the scale-dot product (which is a type of attention mechanism), which captures well vectorial relationships; thus providing a meaningful unified representation, that is useful on the downstream tasks.
3 Conclusion
In this paper, we presented the first effort to leverage MTL for NLP downstream tasks in Fon. Our results on Fon NER and POS tasks showed competitive (and sometimes better) performances of our MTL methods. We hope these results enable more exploration of MTL in NLP for Fon in particular, and African Languages in general. In future works, we want to explore: (a) the impact of merging representations in an additive way, and (b) the dynamic weighted average loss (Rengasamy et al., 2020; Gong et al., 2019) for our training loop; which has been empirically demonstrated to be effective. Additionally, as our results showed that training with the data from all languages, helped the downstream performances (on both tasks), we would like to explore why is this the case i.e., answering the question: “given a target downstream task language , which other language(s) and their data, would be beneficial to ? How and why“? Our code is open-sourced at https://github.com/bonaventuredossou/multitask_fon.
References
- Adelani et al. (2022) David Adelani, Graham Neubig, Sebastian Ruder, Shruti Rijhwani, Michael Beukman, Chester Palen-Michel, Constantine Lignos, Jesujoba Alabi, Shamsuddeen Muhammad, Peter Nabende, Cheikh M. Bamba Dione, Andiswa Bukula, Rooweither Mabuya, Bonaventure F. P. Dossou, Blessing Sibanda, Happy Buzaaba, Jonathan Mukiibi, Godson Kalipe, Derguene Mbaye, Amelia Taylor, Fatoumata Kabore, Chris Chinenye Emezue, Anuoluwapo Aremu, Perez Ogayo, Catherine Gitau, Edwin Munkoh-Buabeng, Victoire Memdjokam Koagne, Allahsera Auguste Tapo, Tebogo Macucwa, Vukosi Marivate, Mboning Tchiaze Elvis, Tajuddeen Gwadabe, Tosin Adewumi, Orevaoghene Ahia, Joyce Nakatumba-Nabende, Neo Lerato Mokono, Ignatius Ezeani, Chiamaka Chukwuneke, Mofetoluwa Oluwaseun Adeyemi, Gilles Quentin Hacheme, Idris Abdulmumin, Odunayo Ogundepo, Oreen Yousuf, Tatiana Moteu, and Dietrich Klakow. 2022. MasakhaNER 2.0: Africa-centric transfer learning for named entity recognition. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4488–4508, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Alabi et al. (2022) Jesujoba O. Alabi, David Ifeoluwa Adelani, Marius Mosbach, and Dietrich Klakow. 2022. Adapting pre-trained language models to African languages via multilingual adaptive fine-tuning. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4336–4349, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Capo (1991) Hounkpati B. C. Capo. 1991. A comparative phonology of Gbe. Foris Publications.
- Caruana (1993) R Caruana. 1993. Multitask learning: A knowledge-based source of inductive bias1. In Proceedings of the Tenth International Conference on Machine Learning, pages 41–48. Citeseer.
- Caruana (1997) Rich Caruana. 1997. Multitask learning. Machine Learning, 28(1):41–75.
- Conneau et al. (2020) Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
- Crawshaw (2020) Michael Crawshaw. 2020. Multi-task learning with deep neural networks: A survey. arXiv preprint arXiv:2009.09796.
- Dione et al. (2023) Cheikh M. Bamba Dione, David Adelani, Peter Nabende, Jesujoba Alabi, Thapelo Sindane, Happy Buzaaba, Shamsuddeen Hassan Muhammad, Chris Chinenye Emezue, Perez Ogayo, Anuoluwapo Aremu, Catherine Gitau, Derguene Mbaye, Jonathan Mukiibi, Blessing Sibanda, Bonaventure F. P. Dossou, Andiswa Bukula, Rooweither Mabuya, Allahsera Auguste Tapo, Edwin Munkoh-Buabeng, victoire Memdjokam Koagne, Fatoumata Ouoba Kabore, Amelia Taylor, Godson Kalipe, Tebogo Macucwa, Vukosi Marivate, Tajuddeen Gwadabe, Mboning Tchiaze Elvis, Ikechukwu Onyenwe, Gratien Atindogbe, Tolulope Adelani, Idris Akinade, Olanrewaju Samuel, Marien Nahimana, Théogène Musabeyezu, Emile Niyomutabazi, Ester Chimhenga, Kudzai Gotosa, Patrick Mizha, Apelete Agbolo, Seydou Traore, Chinedu Uchechukwu, Aliyu Yusuf, Muhammad Abdullahi, and Dietrich Klakow. 2023. Masakhapos: Part-of-speech tagging for typologically diverse african languages.
- Dossou et al. (2022) Bonaventure F. P. Dossou, Atnafu Lambebo Tonja, Oreen Yousuf, Salomey Osei, Abigail Oppong, Iyanuoluwa Shode, Oluwabusayo Olufunke Awoyomi, and Chris Emezue. 2022. AfroLM: A self-active learning-based multilingual pretrained language model for 23 African languages. In Proceedings of The Third Workshop on Simple and Efficient Natural Language Processing (SustaiNLP), pages 52–64, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- Dossou and Emezue (2020) Bonaventure FP Dossou and Chris C Emezue. 2020. Ffr v1.1: Fon-french neural machine translation. arXiv preprint arXiv:2006.09217.
- Dossou et al. (2023) Bonaventure FP Dossou, Atnafu Lambebo Tonja, Chris Chinenye Emezue, Tobi Olatunji, Naome A Etori, Salomey Osei, Tosin Adewumi, and Sahib Singh. 2023. Adapting pretrained asr models to low-resource clinical speech using epistemic uncertainty-based data selection. arXiv preprint arXiv:2306.02105.
- Eberhard et al. (2020) David M. Eberhard, Gary F. Simons, and Charles D. Fennig (eds.). 2020. Ethnologue: Languages of the world. twenty-third edition.
- Emezue and Dossou (2021) Chris Chinenye Emezue and Bonaventure F. P. Dossou. 2021. MMTAfrica: Multilingual machine translation for African languages. In Proceedings of the Sixth Conference on Machine Translation, pages 398–411, Online. Association for Computational Linguistics.
- Gessler and Zeldes (2022) Luke Gessler and Amir Zeldes. 2022. MicroBERT: Effective training of low-resource monolingual BERTs through parameter reduction and multitask learning. In Proceedings of the The 2nd Workshop on Multi-lingual Representation Learning (MRL), pages 86–99, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- Gong et al. (2019) Ting Gong, Tyler Lee, Cory Stephenson, Venkata Renduchintala, Suchismita Padhy, Anthony Ndirango, Gokce Keskin, and Oguz H. Elibol. 2019. A comparison of loss weighting strategies for multi task learning in deep neural networks. IEEE Access, 7:141627–141632.
- Joshi et al. (2020a) Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2020a. The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282–6293, Online. Association for Computational Linguistics.
- Joshi et al. (2020b) Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. 2020b. The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282–6293, Online. Association for Computational Linguistics.
- Kurin et al. (2022) Vitaly Kurin, Alessandro De Palma, Ilya Kostrikov, Shimon Whiteson, and Pawan K Mudigonda. 2022. In defense of the unitary scalarization for deep multi-task learning. In Advances in Neural Information Processing Systems, volume 35, pages 12169–12183. Curran Associates, Inc.
- Lefebvre and Brousseau (2002) Claire Lefebvre and Anne-Marie Brousseau. 2002. A grammar of Fongbe. Mouton de Gruyter.
- Liu et al. (2019) Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019. Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504.
- Loshchilov and Hutter (2017) Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- Nekoto et al. (2020) Wilhelmina Nekoto, Vukosi Marivate, Tshinondiwa Matsila, Timi Fasubaa, Taiwo Fagbohungbe, Solomon Oluwole Akinola, Shamsuddeen Muhammad, Salomon Kabongo Kabenamualu, Salomey Osei, Freshia Sackey, Rubungo Andre Niyongabo, Ricky Macharm, Perez Ogayo, Orevaoghene Ahia, Musie Meressa Berhe, Mofetoluwa Adeyemi, Masabata Mokgesi-Selinga, Lawrence Okegbemi, Laura Martinus, Kolawole Tajudeen, Kevin Degila, Kelechi Ogueji, Kathleen Siminyu, Julia Kreutzer, Jason Webster, Jamiil Toure Ali, Jade Abbott, Iroro Orife, Ignatius Ezeani, Idris Abdulkadir Dangana, Herman Kamper, Hady Elsahar, Goodness Duru, Ghollah Kioko, Murhabazi Espoir, Elan van Biljon, Daniel Whitenack, Christopher Onyefuluchi, Chris Chinenye Emezue, Bonaventure F. P. Dossou, Blessing Sibanda, Blessing Bassey, Ayodele Olabiyi, Arshath Ramkilowan, Alp Öktem, Adewale Akinfaderin, and Abdallah Bashir. 2020. Participatory research for low-resourced machine translation: A case study in African languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2144–2160, Online. Association for Computational Linguistics.
- Ogueji et al. (2021) Kelechi Ogueji, Yuxin Zhu, and Jimmy Lin. 2021. Small data? no problem! exploring the viability of pretrained multilingual language models for low-resourced languages. In Proceedings of the 1st Workshop on Multilingual Representation Learning, pages 116–126, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Rengasamy et al. (2020) Divish Rengasamy, Mina Jafari, Benjamin Rothwell, Xin Chen, and Grazziela P. Figueredo. 2020. Deep learning with dynamically weighted loss function for sensor-based prognostics and health management. Sensors, 20(3).
- Ruder (2017) Sebastian Ruder. 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098.
- Zhang et al. (2022) Zhihan Zhang, Wenhao Yu, Mengxia Yu, Zhichun Guo, and Meng Jiang. 2022. A survey of multi-task learning in natural language processing: Regarding task relatedness and training methods. arXiv preprint arXiv:2204.03508.
Appendix A Linguistic Description of Fon
Fon also known as Fongbé is a native language of the Benin Republic. It is spoken on average by 1.7 million people. Fon belongs to the Niger-Congo-Gbe languages family. It is a tonal, isolating, and left-behind language according to (Joshi et al., 2020b), with an Subject-Verb-Object (SVO) word order. Fon has about 53 different dialects, spoken throughout Benin (Lefebvre and Brousseau, 2002; Capo, 1991; Eberhard et al., 2020). Its alphabet is based on the Latin alphabet, with the addition of the letters: \mo, \md, \me, and the digraphs gb, hw, kp, ny, and xw. There are 10 vowels phonemes in Fon: 6 said to be closed [i, u, ĩ, ũ], and 4 said to be opened [\me, \mo, a, ã]. There are 22 consonants (m, b, n, \md, p, t, d, c, j, k, g, kp, gb, f, v, s, z, x, h, xw, hw, w). Fon has two phonemic tones: high and low. High is realized as rising (low–high) after a consonant. Basic disyllabic words have all four possibilities: high-high, high-low, low-high, and low-low. In longer phonological words, like verb and noun phrases, a high tone tends to persist until the final syllable. If that syllable has a phonemic low tone, it becomes falling (high–low). Low tones disappear between high tones, but their effect remains as a downstep. Rising tones (low–high) simplify to high after high (without triggering downstep) and to low before high (Lefebvre and Brousseau, 2002; Capo, 1991).