Vec2Gloss: definition modeling leveraging contextualized vectors with Wordnet gloss

Yu-Hsiang Tseng, Mao-Chang Ku, Wei-Ling Chen, Yu-Lin Chang, Shu-Kai Hsieh
Graduate Institute of Linguistics, National Taiwan University
[email protected], [email protected], [email protected],
[email protected], [email protected]

Abstract

Contextualized embeddings are proven to be powerful tools in multiple NLP tasks. Nonetheless, challenges regarding their interpretability and capability to represent lexical semantics still remain. In this paper, we propose that the task of definition modeling, which aims to generate the human-readable definition of the word, provides a route to evaluate or understand the high dimensional semantic vectors. We propose a ‘Vec2Gloss’ model, which produces the gloss from the target word’s contextualized embeddings. The generated glosses of this study are made possible by the systematic gloss patterns provided by Chinese Wordnet. We devise two dependency indices to measure the semantic and contextual dependency, which are used to analyze the generated texts in gloss and token levels. Our results indicate that the proposed ‘Vec2Gloss’ model opens a new perspective to the lexical-semantic applications of contextualized embeddings.

1 Introduction

The rapid advancement of distributed semantic models has achieved remarkable results, with machine performance in some language-related benchmarks either matching or even surpassing that of human non-experts (Maru et al., 2022; Chowdhery et al., 2022). These successes are often attributed to the complex pretrained language models (Peters et al., 2018; Devlin et al., 2019; Radford et al., 2019; Raffel et al., 2020) which are broadly referred to as sentence encodings in the literature (Pavlick, 2022). In contrast to the traditional distributional semantic models (Lenci, 2018; Boleda, 2020), the sentence encodings are trained top-down, where processing sentences are the primary goal, and the word-level semantics are emergent properties (Pavlick, 2022).

Studies have shown sentence encodings do capture lexical semantics. While the contextualized embeddings of each token are highly intertwined with sentiment and syntax (Yenicelik et al., 2020), one could still access a wealth of information on word-level lexical semantics by averaging the vectors across context and model layers. When configured properly, these emerged lexical representations outperform the explicitly trained static word vector models (Vulić et al., 2020). Arguably, these contextualized embeddings are possibly sense-aware. One could build the sense embeddings with which the word sense disambiguation is framed as finding the nearest neighbor of the target word in the sense embedding space (Scarlini et al., 2020b). These studies demonstrated that while sentence encodings are not explicitly trained for word-level semantics, they capture the nuances of word usage to a certain degree.

Nonetheless, challenges regarding their interpretability and capability to represent lexical semantics still remain. There have been many evaluations proposed. One unique approach is definition modeling, which aims to generate a definition given the word. The approach is argued to be a more transparent and direct evaluation of the words’ semantic representations (Noraset et al., 2017; Gardner et al., 2022). In light of distributional semantic models, the definition modeling could be understood as first encoding the semantic representations into one or a set of vectors, based on which a language model generates the corresponding definitions. Past studies have provided abundant model architecture choices with fruitful results. However, the unique merit of the definition modeling is that one can now analyze the embeddings in a natural language form, i.e., the definitions. Instead of indirectly examining a high dimensional vector through word analogies and similarities, we can again probe into the (distributional) lexical semantics transparently with human’s natural language.

The follow-up challenge remains as to how to systematically study the generated definitions, especially when it is produced by a model that may or may not capture the nuances of definition language. In this paper, we investigate the model-generated definitions using a relatively standardized gloss language to train a definition generation model. The gloss dataset comes from the Chinese Wordnet (CWN) (Huang et al., 2010)¹¹1The data are accessible at https://lopentu.github.io/CwnWeb/. Chinese Wordnet differentiates the lexical senses of each word and describes them with a relatively constrained set of glossing rules. We formulate the definition modeling as a vector-to-text task. Specifically, inspired by the sense embedding and the sequence-to-sequence architecture of definition modeling (Scarlini et al., 2020b; Mickus et al., 2019), we further encode the context-sensitive word sense into an encoding vector, from which the model learns to decode the gloss sentences. We use human ratings to evaluate the generated definitions and propose two indices to examine the contextual and semantic dependencies closely. With these two indices, we conduct gloss and token-level analyses of generated definitions and show that the generated definitions fairly reflect the aspects of lexical semantics.

The overarching goal of this work aims to explore the possibility of gloss generation with only one contextualized vector. We propose that a generation model can be trained on relatively constrained gloss patterns from the fine-grained Wordnet gloss. The performance of the model is evaluated with human raters with the in-depth analysis of generated gloss patterns.²²2The code and the rating material are available at the anonymized repository: https://anonymous.4open.science/r/vec4gloss-F2C8/

2 Related Work

2.1 Patterns in gloss languages

Dictionary definitions, or word glosses, are “language about language”, or “metalanguage” (Sinclair, 1991; Johnson and Johnson, 1998; Hanks, 2013). One of the popular metalanguage theories is the Natural Semantic Metalanguage (NSM) (Wierzbicka, 1972), which proposes that universal semantic primitives can account for meanings of words. For example, Durst (2004) identifies the features of these primitives as: “indefinability”, “indespensability”, “universality”, and “combinability”. On the other hand, Barque and Polguère (2004) have identified that sense descriptions can be categorized into “word paraphrases” and “word interpretations” according to their formal nature. (cf. Pottier, 1974 and Pustejovsky, 1998)

While previous studies of metalanguage more frequently adopt a logical or formal semantic approach, the Corpus Pattern Analysis (CPA) proposed by Hanks (2004) provides us with a new direction into analyzing word glosses from perspectives of syntagmatic patterns. According to Firth (1957), the meanings of a word are contributed to the context formed by surrounding terms. In the same vein, Hanks (2004) analyzes concordance lines from corpus to generalize typical patterns of certain words. These groups of words constitute lexical set which is united by a semantic type. For example, guns, rifles, pistols are a lexical set related to the verb fire, and they are under the semantic type of firearms (Hanks, 2013).

While not closely following the methodology in CPA, the gloss language in Chinese Wordnet tries to incorporate the lexical sets and semantic types into its gloss. For example, one of the gloss patterns³³3See the manual of CWN (in Chinese), https://lope.linguistics.ntu.edu.tw/cwn/documentation for adverbial senses is in the form of ‘表…的程度’, such as ‘表(超過平常)的程度’ (a sense of 很 hěn ‘very’), which literally translates to ‘describing (exceeding normal) extent.’ The same glossing guidelines are created across lexical categories. Therefore, the CWN glosses provide us with a fertile ground to systematically model its gloss language. However, as the gloss patterns are too complex for logical or formal analyses, definition modeling with deep learning is beneficial for exploring the hidden information under these gloss patterns.

2.2 Definition Modeling

Definition modeling generates a definition for a target word (Gardner et al., 2022; Noraset et al., 2017). Noraset et al. (2017) leverage hypernym embeddings to generate dictionary definitions. Since it is inevitably difficult to capture the sense of a polysemy given a single word input, Gadetsky et al. (2018) incorporate the context words’ embeddings and an attention-based skip-gram model to build definition modeling on polysemous words. Recent definition modeling further incorporates other architectures to capture the semantic vectors and definition generation better. To obtain the semantic representations of the target word, models in past studies use recurrent neural networks, variational generative models, and other pretrained language models (Ishiwatari et al., 2019; Reid et al., 2020; Zhang et al., 2020). Notably, some studies leveraged lexical resources such as HowNet and WordNet to construct latent vectors or use them as guiding signals (Dong and Dong, 2006; Luo et al., 2018a, b; Blevins and Zettlemoyer, 2020; Li et al., 2020; Scarlini et al., 2020a; Yang et al., 2020).

Recently, contextualized embeddings have been shown to capture essential aspects of lexical semantics (Peters et al., 2018; Loureiro and Jorge, 2019) in the word sense disambiguation literature. Utilizing the sense vectors built from the contextual embeddings, Scarlini et al. (2020b) found that a simple 1-nearest-neighbor algorithm achieves comparable performance with other supervised model architectures in the word sense disambiguation task. The result demonstrates that the encoded vectors could carry significant semantic information, which should be not only applicable when disambiguating the polysemous words, but beneficial to definition modeling.

In the following, we propose a Vec2Gloss model to formulate the definition modeling as a sequence-to-sequence problem with an encoder-decoder architecture (Mickus et al., 2019). However, an important distinction is that the goal of Vec2Gloss is to decode the definition out of the encoded vectors while simultaneously tuning the encoder for an optimized semantic vector. Therefore, we leverage the pretrained mT5 (Xue et al., 2021) text-to-text model architecture but impose a tight bottleneck between the encoder and decoder. Furthermore, as the decoder no longer has access to the full sentential context, the generated gloss could not depend on the collocations directly. Specifically, the decoder could only produce the gloss with the encoded semantic vectors and the learned gloss patterns.

3 Vec2Gloss Model

The goal of the Vec2Gloss model is to produce an intelligible gloss from a word’s semantic vector based on Chinese Wordnet. The task is highly related to, although distinct from, the common NLP tasks. Specifically, the model’s objective is more than obtaining an encoder representation and mapping a lexical word/sense into a vector. The vector must be optimized for decoding the gloss. On the other hand, the task is more than a standard autoregressive one, as the generated gloss must be conditioned on a vector rather than prompts or input sequences. An encoder-decoder architecture might be the closest option, but the standard task is to map between the input and output text. It is unclear what the model learns to decode the gloss from the semantic vector or just translate it from the input text.

Refer to caption — Figure 1: The model architecture of Vec2Gloss. The model follows a general encoder-decoder architecture, but a bottleneck is imposed between the encoder and decoder. The decoder, instead of having access to the full encoder states, only sees the target word’s embeddings ( $v_{sem}$ ).

To leverage the encoder-decoder architecture and simultaneously ensure the model relies on a semantic vector to decode the gloss, we impose a tight bottleneck between the encoder and decoder (Figure 1). The model’s input is a sentence containing a target word. The input sentence is transformed by an encoder and results in a set of encoder states. Next, a predefined target mask is applied to the encoder states, and the target’s encoder vectors are selected. These vectors are then averaged into a single vector and fed into the decoder, which learns to generate the gloss sequence. Notably, instead of mixing the encoder states with cross attentions in standard architecture, the decoder here only has one encoder vector to attend to. That is, the decoder cannot access the complete input sentences. Therefore, the encoder is driven to compress as much information as possible into the target word’s semantic vector ( $v_{sem.}$ ). On the other hand, the decoder must learn the gloss’s regularities instead of relying on potential collocation cues between the word context and the gloss. Taken together, the model simultaneously learns the target word’s semantic vector with the encoder, from which the decoder produces the gloss sequence.

To enhance the model’s ability to capture the pattern of gloss sequences, we propose a denoising stage prior to training for the vector-to-gloss task. In the denoising stage, a standard encoder-decoder architecture is used, and the model learns to reconstruct the corrupted spans in the glosses. The purpose is to pretrain the model to capture the regularities of gloss language better. Afterward, we impose the bottleneck between the encoder and decoder in the fine-tuning stage. The model takes as input a sentence containing a target word along with a target mask, and it needs to learn the target word’s semantic vector with the encoder and generate as the output the gloss sentence entirely from the semantic vector.

3.1 Denoising stage

We first train the model with a denoising objective to better capture the patterns underlying the gloss language. Following the procedures of previous studies (Lewis et al., 2020), we prepare pairs of examples consisting of corrupted spans as inputs and the dropped-out spans as outputs. Such denoising objective is shown to perform well on downstream tasks and be computationally efficient as having shorter decoding sequences (Raffel et al., 2020). A pair of such examples is shown as follows, and the literal translations are in italics:

Input	以文字媒介⟨X⟩出來的訊息。
	using text medium ⟨X⟩ -out information.
Target	⟨X⟩表達⟨Y⟩
	⟨X⟩express⟨Y⟩

The ⟨X⟩ and ⟨Y⟩ stand for the special sentinel tokens; they are unique within an example. The spans are character-based and may not follow the word boundaries. The corrupted locations are randomly selected, and the lengths (number of characters) are randomly drawn from a Poisson distribution with $\lambda=2$ , with the value clipped between 1 and 4 (inclusive). If the input sequence is longer than 20 characters, another corrupted span is created with the same parameter. The data are extracted from the word glosses of Chinese Wordnet. There are 26,118 pairs created for the denoising objective.

We use the pretrained T5 encoder-decoder architecture (mt5-base) to train the denoising objective (Xue et al., 2021). In the denoising stage, no bottleneck is applied. The model parameters are updated with AdamW optimizer. The learning rate is $10^{-4}$ , $\beta_{1}$ and $\beta_{2}$ are 0.9 and 0.999, respectively, and weight decay is set to 0.01. A linear schedule is applied to the learning rate, and the batch size is set to 8. The model was trained for 3 epochs and took 30 minutes in an A5000 GPU. The trained model parameters are the starting point of the next fine-tuning stage.

3.2 Fine-tuning stage

The fine-tuning stage aims to learn the relationships between the target words embedded in the sentences and their glosses in CWN. In addition to the standard T5 encoder-decoder transformer-based architecture, a tight bottleneck is introduced between the encoder and decoder. Namely, only the target word’s encoder states, which may consist of more than one token, are selected and averaged as the semantic vector, from which the decoder learns to produce a complete gloss sentence.

The training data are extracted from the CWN’s sense inventories. A training instance is created for each example sentence in a CWN sense. The instance is composed of a pair of input and target sequences. The input is an example sentence in which the target words are marked with a pair of angular brackets. The target sequences are the glosses preceded by the part-of-speeches of the senses and followed by a Chinese full-width period. There are 76,969 instances in the training dataset and 8,553 pairs in the evaluation dataset. A sample instance is shown as follows:

Inp.	她不知道為了什麼事而默默不⟨語⟩。
	She didn’t ⟨say⟩ a word for some reason.
Tgt.	VA。透過發聲器官，用語音傳送訊息。
	VA. Using vocal organs to convey
	a message with speech.

The model architecture closely follows the standard T5; thus, the trained weights in the denoising stage are directly applicable to this model. Notably, the target words’ angular brackets are preprocessed and removed to produce the target mask. The mask is used to select the relevant encoder states and generate the semantic vector for the decoder. Therefore, the cross attention of the decoder will always receive a single vector as input. In training time, the model is trained as a text-to-text task. However, in inference time, the encoder and decoder could work independently. That is, the encoder could be used to obtain a semantic vector from a given sentence. The semantic vector can then be flexibly transformed before sending it into the decoder for gloss generation.

The training procedure is the same as the previous stage, the only difference being that the epoch number is 10 for this stage. The training time is 100 minutes in an A5000 GPU.

3.3 Automatic evaluations

The automatic evaluation of the definition generation is shown in Table 1, which presents the BLEU and METEOR scores for each lexical category. The overall score is .41 for BLEU and .62 for METEOR. It is noteworthy that the lowest category is the noun (N), while the highest one is the proper name (Nb). The higher score of the proper name may be attributed to words used as a family or foreign names in CWN. Their definitions are short and are thus more likely to be captured by the model. These names account for 188 items in the proper names category. On the other hand, it is less clear how to interpret the automatic metrics in other categories. The scores only indicate the textual difference between the generated and the reference gloss. At a given score level, the generated gloss might be unintelligible to the human reader or just a paraphrase of the reference gloss. Therefore, we study the generated glosses further with human evaluations, including a rating experiment, a gloss dependency analysis, and a token dependency analysis.

POS	N	BLEU	METEOR
N	2,801	.35(.01)	.59(.01)
V	4,376	.43(.01)	.63(.01)
D	432	.41(.02)	.62(.02)
O	530	.41(.02)	.63(.01)
Nb	414	.63(.02)	.74(.02)
All	8,553	.41(.01)	.62(.01)

Table 1: Automatic evaluation metrics on different lexical categories, which are nouns (N), verbs (V), adverbs (D), others (O), and proper names (Nb). Numbers in parentheses are standard errors.

4 Human Evaluations

4.1 Rating experiment

In the experiment, we resort to human raters to evaluate the capability of the generated definitions, i.e., their semantic interpretability and syntactic well-formedness. The task is designed as a multiple-choice task with only one correct answer. A total of 140 entries, each consisting of a definition in Chinese and a list of four-word options, are provided. Materials used for the experiment are derived from two sources: Academia Sinica Balanced Corpus of Modern Chinese (henceforth ASBC, Huang and Chen, 1998) and CWN.

Among the 140 test items, 40 are new words with their definitions derived from our Vec2Gloss model (namely V2G:ex vivo). We extract words only composed of Chinese characters, remove proper nouns, and filter out words occurring less than 10 times in the corpus. The target words, i.e., the correct answers, are randomly and equally selected from four different lexical categories: nouns, verbs, adverbs, and others. The incorrect options of each question are of the same word class, randomly selected from the same collection of words derived from ASBC. The remaining 100 words are all taken from the evaluation dataset, with proper names being excluded. Among the 100 words, 20 use definitions from CWN, 80 from model generation (namely V2G:in vivo). The word class composition is the same for the ones from CWN, and the target words are randomly selected from the dataset and equally divided according to the word class.

Five native Chinese speakers majoring in linguistics were recruited as raters. Their first task was to determine the most suitable term from a set of four options based on its given definition. Next, the raters were asked to evaluate the semantic interpretability of a definition using a five-point acceptability judgment scale. That is, they were required to rate on a scale of one to five, to what extent the definition can well explain the word that had been chosen as the correct answer from the previous task. Finally, the raters were asked to evaluate the syntactic well-formedness of a definition, i.e., how much the rater accepted the definition as being well-formed based on his or her internal grammar, on a five-point acceptability judgment scale. Table 2 shows the evaluation results with these two metrics, where we can see the proposed Vec2Gloss model achieves decent performance in comparison with the original glosses in CWN.

Source	Correctness	Mean_sem	Mean_syn
CWN	.95(.02)	4.47(.15)	4.82(.10)
V2G:in vivo	.88(.03)	3.51(.16)	4.58(.09)
V2G:ex vivo	.86(.04)	2.53(.22)	4.51(.12)

Table 2: Human evaluation results for definitions generated from different sources, with Mean_sem and Mean_syn representing the mean value of semantic interpretability and syntactic well-formedness, respectively.

More detailed results for evaluations of vector-generated glosses are shown in Table 3. While the mean values for syntactic well-formedness for both V2G:in vivo and V2G:ex vivo are considerably high across all the 4 lexical categories, the semantic interpretability for V2G:ex vivo scores less than for V2G:in vivo. In spite of the inferior interpretability, the multiple-choice task for V2G:ex vivo still achieves over 80% correct rates in every category, similar to the results of V2G:in vivo. Moreover, the semantic scores of nouns are lower than other categories for both sources. To further investigate possible reasons that account for the results, we conduct a gloss dependency analysis.

POS	V2G:in vivo			V2G:ex vivo
POS	Correctness	Mean_sem	Mean_syn	Correctness	Mean_sem	Mean_syn
N	.94 (.04)	3.18 (.35)	4.14 (.25)	.86 (.08)	1.92 (.40)	4.32 (.34)
V	.89 (.06)	3.63 (.34)	4.79 (.10)	.86 (.08)	2.74 (.46)	4.48 (.27)
D	.84 (.06)	3.75 (.31)	4.69 (.18)	.84 (.07)	2.76 (.43)	4.74 (.16)
O	.85 (.06)	3.47 (.32)	4.70 (.16)	.86 (.10)	2.70 (.45)	4.50 (.20)

Table 3: Human evaluation results for different lexical categories of definitions generated from V2G:in vivo and V2G:ex vivo. The semantic evaluation scores of noun are lower than other categories for both sources.

4.2 Gloss dependency analysis

Two indices are computed for each token to represent their reliance on the preceding contexts and the semantic vector, respectively. First, the token likelihood under the full context and the original semantic vector ( $p_{\textrm{full}}$ ) is compared to the one with all of its preceding contexts masked when decoding. If a token is mostly determined by the context alone, the context masking would significantly impact the token likelihood ( $p_{\textrm{mask}}$ ). Hence, the negative likelihood ratio ( $\delta_{\textrm{sem}}$ ) will be larger. Similarly, if a token is primarily driven by the semantic vector, replacing it while leaving the preceding context intact will lower the likelihood ( $p_{\textrm{rep}}$ ) and make the ratio ( $\delta_{\textrm{ctx.}}$ ) larger. Specifically, the semantic vector ( $v_{sem.}$ coming from the encoder) is replaced with another word’s semantic vector from the same lexical category. The indices are all calculated using the shifted reference glosses of each sense as the decoder inputs.

	$\displaystyle\delta_{\textrm{sem}}$	$\displaystyle=-\log(p_{\textrm{rep}}/p_{\textrm{full}})$
	$\displaystyle\delta_{\textrm{ctx}}$	$\displaystyle=-\log(p_{\textrm{mask}}/p_{\textrm{full}})$

Each token’s indices, $\delta_{\textrm{sem}}$ and $\delta_{\textrm{ctx.}}$ , are averaged to produce the gloss-level indices. The results are shown in Figure 2. We first observe that while the contextual dependency is comparable across four different lexical categories, the semantic vector dependency indices show more differences. Specifically, the nouns’ glosses have higher semantic dependency scores, followed by verbs, adverbs, and others. These results echo the human ratings, in which the syntactic ratings are similar in all categories, while nouns are significantly worse in semantic rating scores. The difference in semantic dependency may indicate that nouns are more likely to be used as nominal predicates, and they categorize referents in a class with a holistic set of properties. In contrast, the adverbs, which have the lowest semantic dependency score here, only describe things by adding a single property to the characterization of the referent (Baker and Croft, 2017; Bolinger, 1980). However, if this is the case, the adverbs present interesting cases to study further. Although they are relatively unaffected by semantic vector manipulation, they still carry semantic meaning such as manner, mean, or instrument into their scopes (Lyons, 1977; Lakoff, 1968). Therefore, we should observe some tokens are more pertinent to the semantic vector than others in a more fine-grained analysis.

4.3 Token dependency analysis

Following that, we manually identify chunks (semantic constituency) in the gloss and annotate their semantic types. The chunk as defined here is a significant element which functions as a semantic type-carrying unit (cf. Gerdes and Kahane, 2013). We selected 244 adverbs from CWN, whose gloss contains the word “事件” shjiàn ‘event’, as they describe an explicit event structure in the glosses. Each gloss was first segmented into length-variant chunks and manually tagged with its corresponding semantic type. The gloss of the first words is not annotated with semantic types, as they regularly follow a gloss pattern based on their lexical category. For example, the glosses of adverbs start with the word 表 biǎo ‘indicate,’ such as the following example (the gloss of 接連 jiēlián ‘in a row’).

Gloss	表/同一事件/在/後述時段/中/持續/發生。
	To express the same event continuously
	happens during the later-mentioned period.
Annot.	–/Event/Preposition/Time/Preposition/
	Modifier/Action

There are 905 chunks and 19 unique semantic types annotated in this dataset. Six semantic types (event, action, modifier, pre/post-position, negation, others) occurring at least 25 times (10% of the glosses count) are selected for further analysis, which accounts for 59% of annotated chunks. Figures in the Appendix show their distribution in different positions in the gloss.

The token level indices are computed as in Sec. 4.2. Notably, since the annotated glosses may have multiple example sentences in CWN, we extract and average the semantic vectors from each sentence to represent the target words. The context and semantic vector dependency scores are computed for each token and averaged by their semantic types. The results are shown in Figure 3.

In alignment with the observation in gloss-level analysis, there are distinctive dependency patterns across different semantic types. In particular, the Action categories are higher in contextual dependency but relatively low in semantic ones. While Action words are usually the main verbs in the glosses, the type distribution is highly skewed. The first three common action words, 發生 fāshēng ‘occur’, 做 zuò ‘make’ , 進行 jìnxíng ‘undergo’ already account for 50% of all action words. Thus, the contextual dependency scores may reflect the constrained word usage given the adverb gloss. In contrast, the Preposition and Negation types are relatively high in semantic vector dependency. It might result from the fact that prepositions are used to introduce the relating complements, and the decoder needs guidance from the semantic vectors to select the exact relations for the gloss. Similarly, the negation words are hard to capture by syntagmatic relations from the context (Aina et al., 2019; Ettinger, 2020); thus, the decoder has to rely on additional cues from the semantic vectors. Consistently, words are highly predictable given the adverb glosses, i.e., the Event type, are low on both dependency scores. However, it is interesting for future work to analyze all the patterns or semantic types of the glosses and investigate the exact roles of contextual and semantic dependencies in the gloss patterns.

5 Conclusion

This paper introduces a definition model called Vec2Gloss. In addition to training a gloss generation model that directly decodes the gloss from one semantic vector while optimizing the encoder simultaneously, we also examine the generated glosses to gain a deeper understanding of the lexical-semantic information captured by the model. The systematic study of glosses is made possible by the systematic gloss patterns provided by CWN. In the experiment, we observe that nouns and verbs are more complicated in terms of their higher semantic dependency, while adverbs are less so. Furthermore, in the token-level analysis, we also found models need additional guidance from semantic vectors for negation words and selecting proper prepositions. Understanding semantic vectors in high-dimensional space is challenging; the reformulated definition modeling task provides another way to look into the distributed semantic vectors. Future research can focus on how systematic gloss patterns help us further understand the intricate lexical categories, such as nouns and verbs, and determine the optimal gloss patterns through the application of distributed semantic models.

References

Aina et al. (2019) Laura Aina, Raffaella Bernardi, and Raquel Fernández. 2019. Negated adjectives and antonyms in distributional semantics: not similar? Italian Journal of Computational Linguistics, 5(1):57–71.
Baker and Croft (2017) Mark Baker and William Croft. 2017. Lexical categories: Legacy, lacuna, and opportunity for functionalists and formalists. Annual Review of Linguistics, 3(1):179–197.
Barque and Polguère (2004) Lucie Barque and Alain Polguère. 2004. A definitional metalanguage for explanatory combinatorial lexicography.
Blevins and Zettlemoyer (2020) Terra Blevins and Luke Zettlemoyer. 2020. Moving down the long tail of word sense disambiguation with gloss informed bi-encoders. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1006–1017, Online. Association for Computational Linguistics.
Boleda (2020) Gemma Boleda. 2020. Distributional semantics and linguistic theory. Annual Review of Linguistics, 6(1):213–234.
Bolinger (1980) Dwight Bolinger. 1980. Language - the Loaded Weapon. Pearson Education Limited.
Chowdhery et al. (2022) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understandingBERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, volume 1, pages 4171–4186, Minneapolis, Minnesota.
Dong and Dong (2006) Zhendong Dong and Qiang Dong. 2006. HowNet and the Computation of Meaning.
Durst (2004) Uwe Durst. 2004. The natural semantic metalanguage approach to linguistic meaning. Theoretical Linguistics, 29(3).
Ettinger (2020) Allyson Ettinger. 2020. What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics, 8:34–48.
Firth (1957) John Firth. 1957. A synopsis of linguistic theory, 1930-1955. Studies in Linguistic Analysis, pages 10–32.
Gadetsky et al. (2018) Artyom Gadetsky, Ilya Yakubovskiy, and Dmitry Vetrov. 2018. Conditional generators of words definitions. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 266–271, Melbourne, Australia. Association for Computational Linguistics.
Gardner et al. (2022) Noah Gardner, Hafiz Khan, and Chih-Cheng Hung. 2022. Definition modeling: literature review and dataset analysis. Applied Computing and Intelligence, 2(1):83–98.
Gerdes and Kahane (2013) Kim Gerdes and Sylvain Kahane. 2013. Defining dependencies (and constituents). Frontiers in Artificial Intelligence and Applications, 258:1–25.
Hanks (2004) Patrick Hanks. 2004. Corpus pattern analysis. In Proceedings of the 11th EURALEX International Congress, pages 87–97, Lorient, France. Université de Bretagne-Sud, Faculté des lettres et des sciences humaines.
Hanks (2013) Patrick Hanks. 2013. Lexical analysis: Norms and exploitations. MIT Press.
Huang and Chen (1998) Chu-Ren Huang and Keh-jiann Chen. 1998. Academia sinica balanced corpus of modern chinese. Technical report, Academia Sinica.
Huang et al. (2010) Chu-Ren Huang, Shu-Kai Hsieh, Jia-Fei Hong, Yun-Zhu Chen, I-Li Su, Yong-Xiang Chen, and Shen-Wei Huang. 2010. Constructing chinese wordnet: Design principles and implementation. (in chinese). Zhong-Guo-Yu-Wen, 24:2:169–186.
Ishiwatari et al. (2019) Shonosuke Ishiwatari, Hiroaki Hayashi, Naoki Yoshinaga, Graham Neubig, Shoetsu Sato, Masashi Toyoda, and Masaru Kitsuregawa. 2019. Learning to describe unknown phrases with local and global contexts. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3467–3476, Minneapolis, Minnesota. Association for Computational Linguistics.
Johnson and Johnson (1998) Keith Johnson and Helen Johnson. 1998. Encyclopedic Dictionary of Applied Linguistics: A Handbook for Language Teaching. Blackwell Publishers.
Lakoff (1968) George Lakoff. 1968. Instrumental adverbs and the concept of deep structure. Foundations of language, pages 4–29.
Lenci (2018) Alessandro Lenci. 2018. Distributional models of word meaning. Annual Review of Linguistics, 4(1):151–171.
Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
Li et al. (2020) Jiahuan Li, Yu Bao, Shujian Huang, Xinyu Dai, and Jiajun Chen. 2020. Explicit semantic decomposition for definition generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 708–717.
Loureiro and Jorge (2019) Daniel Loureiro and Alípio Jorge. 2019. Language modelling makes sense: Propagating representations through WordNet for full-coverage word sense disambiguation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5682–5691, Florence, Italy. Association for Computational Linguistics.
Luo et al. (2018a) Fuli Luo, Tianyu Liu, Zexue He, Qiaolin Xia, Zhifang Sui, and Baobao Chang. 2018a. Leveraging gloss knowledge in neural word sense disambiguation by hierarchical co-attention. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1402–1411, Brussels, Belgium. Association for Computational Linguistics.
Luo et al. (2018b) Fuli Luo, Tianyu Liu, Qiaolin Xia, Baobao Chang, and Zhifang Sui. 2018b. Incorporating glosses into neural word sense disambiguation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2473–2482, Melbourne, Australia. Association for Computational Linguistics.
Lyons (1977) John Lyons. 1977. Semantics. Cambridge University Press.
Maru et al. (2022) Marco Maru, Simone Conia, Michele Bevilacqua, and Roberto Navigli. 2022. Nibbling at the hard core of Word Sense Disambiguation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4724–4737, Dublin, Ireland. Association for Computational Linguistics.
Mickus et al. (2019) Timothee Mickus, Denis Paperno, and Matthieu Constant. 2019. Mark my word: A sequence-to-sequence approach to definition modeling. In Proceedings of the First NLPL Workshop on Deep Learning for Natural Language Processing, pages 1–11, Turku, Finland. Linköping University Electronic Press.
Noraset et al. (2017) Thanapon Noraset, Chen Liang, Larry Birnbaum, and Doug Downey. 2017. Definition modeling: Learning to define word embeddings in natural language. In Thirty-First AAAI Conference on Artificial Intelligence.
Pavlick (2022) Ellie Pavlick. 2022. Semantic structure in deep learning. Annual Review of Linguistics, 8(1):447–471.
Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
Pottier (1974) Bernard Louis Pottier. 1974. Linguistique générale: Théorie et description. Klincksieck.
Pustejovsky (1998) James Pustejovsky. 1998. The Generative Lexicon. MIT Press.
Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
Reid et al. (2020) Machel Reid, Edison Marrese-Taylor, and Yutaka Matsuo. 2020. VCDM: Leveraging Variational bi-encoding and Deep contextualized Word Representations for Improved Definition Modeling. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6331–6344, Online. Association for Computational Linguistics.
Scarlini et al. (2020a) Bianca Scarlini, Tommaso Pasini, and Roberto Navigli. 2020a. Sensembert: Context-enhanced sense embeddings for multilingual word sense disambiguation. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 8758–8765.
Scarlini et al. (2020b) Bianca Scarlini, Tommaso Pasini, and Roberto Navigli. 2020b. With more contexts comes better performance: Contextualized sense embeddings for all-round word sense disambiguation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3528–3539.
Sinclair (1991) John Sinclair. 1991. Corpus, Concordance, Collocation. Oxford University Press.
Vulić et al. (2020) Ivan Vulić, Edoardo Maria Ponti, Robert Litschko, Goran Glavaš, and Anna Korhonen. 2020. Probing pretrained language models for lexical semantics. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7222–7240, Online. Association for Computational Linguistics.
Wierzbicka (1972) Anna Wierzbicka. 1972. Semantic Primitives. Athenäum-Verlag.
Xue et al. (2021) Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
Yang et al. (2020) Liner Yang, Cunliang Kong, Yun Chen, Yang Liu, Qinan Fan, and Erhong Yang. 2020. Incorporating sememes into chinese definition modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:1669–1677.
Yenicelik et al. (2020) David Yenicelik, Florian Schmidt, and Yannic Kilcher. 2020. How does BERT capture semantics? a closer look at polysemous words. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 156–162, Online. Association for Computational Linguistics.
Zhang et al. (2020) Haitong Zhang, Yongping Du, Jiaxin Sun, and Qingxiao Li. 2020. Improving interpretability of word embeddings by generating definition and usage. Expert Systems with Applications, 160:113633.

Appendix A Appendix

Table 4 illustrates some examples of the model-generated glosses. Figure 4 and Figure 5 shows the statistics of semantic type annotations in Section 4.3.

	Input	Generated
1	他⟨還⟩沒開口。 He hasn’t ⟨yet⟩ spoken.	Dfa。表事情尚未完成。 Dfa. Describing the situation not having finished
2	他還沒⟨開⟩口。 He hasn’t yet ⟨spoken⟩.	VC。比喻提出要求。 VC. Making a request.
3	我⟨開⟩了一個會。 I ⟨had⟩ a meeting.	VC。進行會議。 VC. Holding a meeting.
4	這⟨彰顯⟩出重要的價值。 This ⟨exemplifies⟩ an important value.	VJ。顯現出後述事物或特質。 VJ. Showing the quality of the following situation.

Table 4: Examples of the model-generated glosses. The first three instances include the target words already existing in CWN, but the sentences are all new to the model. The second and third ones show the context dependencies of the generated glosses. The target word of the last instance is also new to the model, and the model still generates a plausible gloss. Dfa. Degree adverb. VC. Action transitive verb. VJ. Stative transitive verb.