This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the Relation between Quality-Diversity Evaluation and Distribution-Fitting Goal in Text Generation

Jianing Li    Yanyan Lan    Jiafeng Guo    Xueqi Cheng
Abstract

The goal of text generation models is to fit the underlying real probability distribution of text. For performance evaluation, quality and diversity metrics are usually applied. However, it is still not clear to what extend can the quality-diversity evaluation reflect the distribution-fitting goal. In this paper, we try to reveal such relation in a theoretical approach. We prove that under certain conditions, a linear combination of quality and diversity constitutes a divergence metric between the generated distribution and the real distribution. We also show that the commonly used BLEU/Self-BLEU metric pair fails to match any divergence metric, thus propose CR/NRR as a substitute for quality/diversity metric pair.

Text Generation, Evaluation Metric

1 Introduction

Text generation is an essential task for many NLP applications, such as machine writing (Zhang et al., 2017a), machine translation (Bahdanau et al., 2014), image captioning (Rennie et al., 2017) and dialogue system (Li et al., 2017). Text generation models work by either explicitly modeling the probability distribution of text (Mikolov et al., 2010; Yu et al., 2017), or implicitly learning a generator which maps noise data to text (Zhang et al., 2017b; Chen et al., 2018). Both approaches aim at generating text with the same distribution of given text data.

To achieve the distribution-fitting goal, divergence metrics are usually applied as the training objective for text generation models, which take minimal value 0 if and only if the model distribution exactly recover the real text distribution. Typical choices include the Kullback-Leibler divergence by maximum likelihood estimation (MLE) (Mikolov et al., 2010), and Jensen-Shannon divergence or Wasserstein distance by adversarial training (Yu et al., 2017; Gulrajani et al., 2017). However during evaluation, divergence-based metrics fails to distinguish two under-fitting cases from each other: the low-quality case that generate unrealistic text, and the low-diversity case that generates dull and repeated text. As such, quality and diversity metrics are introduces to help the model diagnosis, such as BLEU (Papineni et al., 2002) and Self-BLEU (Zhu et al., 2018). High generation quality requires the model to generate realistic samples, i.e. generated samples are free of grammatical or logical errors. High generation diversity requires the model to generate diverse samples, i.e. generated samples are less likely to be duplicate and contain diverse unique patterns.

Despite popular application of quality-diversity metrics in evaluation of text generation models (Chen et al., 2018; Lu et al., 2018b; Fedus et al., 2018; Alihosseini et al., 2019), the relationship between such evaluation and the distribution-fitting goal is still not clear. It seems to be a tacit consensus in recent works that a model with both higher quality and higher diversity also better fit the real text distribution (Caccia et al., 2018; Li et al., 2019; d’Autume et al., 2019). However, such assumption is yet to be verified. This is critical since a potential inequivalence may result in misleading evaluation conclusions. In this paper, we try to answer this question under the unconditional text generation setting by a theoretical approach.

To bridge the gap between distribution-fitting goal and quality-diversity evaluation, we require the optimal solutions from divergence minimization to be consistent with that of quality-diversity maximization. As such, we first give a general definition of quality and diversity. Then, we study a Multi-Objective Programming (MOP) problem which maximizes quality and diversity simultaneously. We prove there exists a family of Pareto-optimal solutions for this MOP problem, i.e. solutions which cannot be outperformed in terms of both quality and diversity. Then we prove the real distribution belongs to this Pareto-optimal family if and only if quality-diversity metrics are used in pairs with strong restrictions. Under such condition, a linear combination of quality and diversity constitutes a divergence metric between the generated distribution and the real distribution.

For quality-diversity metrics used in practice, we show that the widely applied BLEU/Self-BLEU metric pair fails to match any divergence metric. This is highlighted by a counter-intuitive observation that real text samples are significantly outperformed by manually constructed models over both BLEU and Self-BLEU. Therefore, we further propose Coverage Rate (CR) and Negative Repetition Rate (NRR) as substitute based on above theoretical analysis. Experiments show that CR/NRR act well as quality/diveristy metrics respectively, while a linear combination of CR/NRR acts well as divergence metric.

2 Related Work

To evaluate the performance of text generation models, many evaluation metrics are designed from different perspectives. Early neural text generation models use Perplexity (PPL) to show how well a language model fit the training data (Mikolov et al., 2010). This is a divergence-based metric, and is still adopted in recent works (Fedus et al., 2018; Lu et al., 2018a; Subramanian et al., 2018). Calculation of PPL may be intractable for implicit models, so other divergence-based metrics are also practical choices, such as Kernel Density Estimation (Zhang et al., 2017b), Word Mover Distance (Lu et al., 2018a), MS-Jaccard (Alihosseini et al., 2019), and Frechet Distance (Semeniuta et al., 2018; Alihosseini et al., 2019; d’Autume et al., 2019). However, divergence metrics provide limited information for model diagnosis, and may not correlate well with task performance (Chen et al., 1998; Fedus et al., 2018). Therefore, the quality and diversity of generated text are further considered as complementary metrics, which are also practical requirements in real applications (Zhang et al., 2018; Hashimoto et al., 2019; Gao et al., 2019).

For quality metrics, the evaluation is closely related to the ground truth distribution. Yu et al. (2017) propose to use Negative Log-Likelihood where the real distribution is known in advance, which measures the average log-probability of generated samples over the real distribution. If the real distribution is not explicitly given, BLEU (Papineni et al., 2002) and ROUGE (Lin & Och, 2004) are usually applied, which measure the nn-gram overlap between generated samples and a set of reference ground truth samples. For diversity metrics, the evaluation is performed within the model itself. Li et al. (2015) proposed Distinct-nn as diversity metric, which calculates the ratio of unique nn-grams in generated samples. Zhu et al. (2018) proposed Self-BLEU, which is similar to BLEU but use generated samples as reference set.

There was a time in the past that only quality metrics are applied for evaluation, such as in works of SeqGAN (Yu et al., 2017), RankGAN (Lin et al., 2017), and LeakGAN (Guo et al., 2017). However after an observation of the quality-diversity tradeoff problem, Zhu et al. (2018) suggest to use a hybrid of both quality and diversity metrics, such as BLEU and Self-BLEU. This suggestion is widely adopted by many analytical works (Lu et al., 2018b; Caccia et al., 2018; Semeniuta et al., 2018; Alihosseini et al., 2019), as well as newly proposed methods, such as FM-GAN (Chen et al., 2018), DDR (Li et al., 2019), and ScratchGAN (d’Autume et al., 2019). Despite the prevailing application of quality-diversity evaluation, its relationship with divergence metrics remains unclear, which poses great uncertainty for evaluation conclusions. Our work will help to build bridges between quality-diversity and divergence, and provide guidance for choosing appropriate quality-diversity metrics.

3 Definition of Quality and Diversity

Currently there is no unified definition for quality and diversity in text generation, which brings great challenges for further theoretical studies. In fact, it is not easy to define a general form of quality and diversity due to various understandings of these two aspects. Thus before moving on to further analysis, we first try to give a general form of quality and diversity in a mathematical view, though it may not be comprehensive enough to cover all possible understandings.

3.1 A General Form of Quality and Diversity

Text data is usually discrete, so we make the following notations. Assume the vocabulary size is |V||V|, and the maximum length is LL, then the distribution of text data can be described by a categorical distribution with size N=|V|LN=|V|^{L}. We denote the real distribution and the generated model distribution as P(x)=(P1,P2,,PN)P(x)=(P_{1},P_{2},\cdots,P_{N}) and Q(x)=(Q1,Q2,,QN)Q(x)=(Q_{1},Q_{2},\cdots,Q_{N}), respectively.

In general, the Quality of a text generation model measures how likely the generated text are to be realistic text in human’s view. Since the value of real probability P(x)P(x) can be viewed as reflecting the realistic degree of a text xx, the expectation of some function over P(x)P(x) could be used to quantify quality. For example, in works of Yu et al. (2017) and Nie et al. (2018), Log-Likelihood (LL) is used as the quality metric, where LL(Q;P)=𝔼xQlogP(x)LL(Q;P)=\mathbb{E}_{x\sim Q}\log P(x). Following this idea, we propose a general form of quality, i.e., U(Q;P)=𝔼xQfu[P(x)]U(Q;P)=\mathbb{E}_{x\sim Q}f_{u}[P(x)], where fuf_{u} is a function over P(x)P(x).

Similarly, the Diversity of a text generation model measures how much difference there are among generated texts. From the viewpoint of information, Shannon-Entropy (SE) of Q(x)Q(x) can be used as a natural diversity metric, where SE(Q)=𝔼xQlogQ(x)SE(Q)=-\mathbb{E}_{x\sim Q}\log Q(x). From another understanding view, a text xx should be less likely to be generated again if the diversity is high. This idea has been adopted in biology to evaluate the diversity of biocoenosis, named as the Simpson’s Diversity Index (SDI), where SDI(Q)=1𝔼xQQ(x)SDI(Q)=1-\mathbb{E}_{x\sim Q}Q(x). Summarizing these two different understandings, we obtain a general form of diversity, i.e. V(Q)=𝔼xQfv[Q(x)]V(Q)=-\mathbb{E}_{x\sim Q}f_{v}[Q(x)].

To this end, we propose a general form of quality and diversity metrics as follows:

U(Q)\displaystyle U(Q) =U(Q;P)=𝔼xQfu[P(x)]=i=1NQif(Pi),\displaystyle=U(Q;P)=\mathbb{E}_{x\sim Q}f_{u}[P(x)]=\sum_{i=1}^{N}Q_{i}\cdot f(P_{i}),
V(Q)\displaystyle V(Q) =𝔼xQfv[Q(x)]=i=1Ng(Qi),\displaystyle=-\mathbb{E}_{x\sim Q}f_{v}[Q(x)]=\sum_{i=1}^{N}g(Q_{i}),

where fu(x)f_{u}(x) is denoted as f(x)f(x) and xfv(x)-x\cdot f_{v}(x) as g(x)g(x).

3.2 The Rationality of Quality and Diversity

To guarantee UU and VV are rational quality and diversity metrics, we need to discuss about the conditions of ff and gg. Without loss of generality, we first assume that ff is differentiable and gg is twice differentiable. Further, the following requirements are necessary for rational quality and diversity:

  1. 1.

    Generating more samples with higher real probability yields higher overall quality;

  2. 2.

    Distributing the probability more equally yields higher overall diversity.

Mathematically, these two requirements can be formalized as the following two properties:

1. If Pi>PjP_{i}>P_{j}, then for Q=(Q1,,Qi+ϵ,,Qjϵ,)Q^{\prime}=(Q_{1},\dots,Q_{i}+\epsilon,\dots,Q_{j}-\epsilon,\dots), there is U(Q)>U(Q)U(Q^{\prime})>U(Q) for any ϵ(0,Qj)\epsilon\in(0,Q_{j}).

2. If QiQjQ_{i}\geq Q_{j}, then for Q=(Q1,,Qi+ϵ,,Qjϵ,)Q^{\prime}=(Q_{1},\dots,Q_{i}+\epsilon,\dots,Q_{j}-\epsilon,\dots), there is V(Q)<V(Q)V(Q^{\prime})<V(Q) for any ϵ(0,Qj)\epsilon\in(0,Q_{j}).

Then we can obtain the conditions of ff and gg by the following theorem:

Theorem 1.

The following conditions are both sufficient and necessary to satisfy the properties 1-2: For any x1,x2x_{1},x_{2} s.t. x1>x2>0x_{1}>x_{2}>0 and x1+x21x_{1}+x_{2}\leq 1, we have f(x1)>f(x2)f(x_{1})>f(x_{2}) and g(x1)<g(x2)g^{\prime}(x_{1})<g^{\prime}(x_{2}).

According to Theorem 1, it is necessary for f(x)f(x) to be strictly monotonically increasing and g(x)g(x) to be strictly concave for x(0,12)x\in(0,\frac{1}{2}). For simplicity, we only consider the cases where such properties hold for x(0,1)x\in(0,1), thus get a sufficient condition:

  1. 1.

    f(x)f(x) is strictly monotonically increasing for x(0,1)x\in(0,1);

  2. 2.

    g(x)g(x) is strictly concave for x(0,1)x\in(0,1).

Under this condition, we can see that a model with highest quality will distribute all its density to text with highest real probability, and a model with highest diversity will be uniform, which are consistent with human understandings.

4 Analysis of Quality-Diversity Evaluation

In this section, we show how and to what extent can the quality-diversity evaluation reflect the distribution-fitting goal. The key idea is to solve the Multi-Objective Programming (MOP) problem which tries to maximize quality and diversity simultaneously. We give the structure of all the Pareto-optima of this MOP problem, which constitutes the Pareto-frontier. Then we prove the ground truth distribution lies in this frontier if and only if ff and gg are paired according to a given rule. Under such condition, a linear combination of quality and diversity constitutes a divergence metric, which means the quality-diversity evaluation is sufficient to reflect the distribution-fitting goal.

4.1 The MOP Problem

We consider the following MOP problem:

maxQ(U(Q),\displaystyle\max_{Q}\ (U(Q), V(Q))\displaystyle V(Q))
s.t.i=1NQi\displaystyle s.t.\sum_{i=1}^{N}Q_{i} =1\displaystyle=1
i,Qi\displaystyle\forall i,\ Q_{i} 0\displaystyle\geq 0

The goal is to maximize both quality and diversity, while keeping QQ a legal distribution. The optimal solutions of a MOP problem are called Pareto-optima, which means no other solution can beat them consistently over all objectives.

We give definitions of the terminologies of Pareto-optimality below:

Definition 1.

For two distributions QQ and QQ^{\prime}, if one of the following conditions are satisfied, we say that QQ is dominated by QQ^{\prime}.

  1. 1.

    U(Q)>U(Q)U(Q^{\prime})>U(Q) and V(Q)V(Q)V(Q^{\prime})\geq V(Q);

  2. 2.

    U(Q)U(Q)U(Q^{\prime})\geq U(Q) and V(Q)>V(Q)V(Q^{\prime})>V(Q).

A solution QQ is called a Pareto-optimum if it is not dominated by any QQ^{\prime}. The set containing all the Pareto-optima is called the Pareto-frontier.

Intuitively, a Pareto-optimum is a solution that there is no distribution can achieve both higher quality and higher diversity than it. And all the Pareto-optima constitutes the Pareto-frontier. The Pareto-frontier may collapse into one solution which leads to a global optimum, e.g. if PP is uniform, the unique optimal solution would be Q=PQ^{*}=P. However it is often the case where the objectives in MOP problem cannot reach their optima consistently, which results in a family of optimal solutions. Therefore, the structure of the Pareto-frontier under a non-uniform PP is what we care about.

4.2 The Pareto-frontier

Refer to caption
Figure 1: Illustration of the Pareto-frontier of LL-SE metric pair on a toy categorical distribution, which contains 2020 categories and probabilities are sampled from uniform distribution with normalization.

We show the structure of the Pareto-frontier by giving the following theorem:

Theorem 2.

For a distribution QQ, if PP is not uniform, then:

(1) The following condition is both sufficient and necessary for QQ to be a Pareto-optimum: there exist real value w0w\leq 0 and bb that for any i=1,,Ni=1,\dots,N, there is

Qi=g^1[wf(Pi)+b],Q_{i}=\hat{g}^{\prime-1}[w\cdot f(P_{i})+b],

where

g^1(x)={g1(x)if x<g(0),0if xg(0),\hat{g}^{\prime-1}(x)=\left\{\begin{array}[]{ll}g^{\prime-1}(x)&\textrm{if $x<g^{\prime}(0)$,}\\ 0&\textrm{if $x\geq g^{\prime}(0)$,}\end{array}\right.

(2) bb is correspondent to ww, i.e. bb is fixed once ww is fixed. If f(x)<0f(x)<0 for all x[0,1]x\in[0,1], then bb is strictly monotonically increasing w.r.t. ww. If f(x)>0f(x)>0 for all x[0,1]x\in[0,1], then bb is strictly monotonically decreasing w.r.t. ww.

(3) Denote a Pareto-optimum QQ as Q(w)Q(w), then for any w1<w2w_{1}<w_{2}: if w1,w2[B,0]w_{1},w_{2}\in[B,0], there is Q(w1)Q(w2)Q(w_{1})\neq Q(w_{2}) and U(Q(w1))>U(Q(w2)),V(Q(w1))<V(Q(w2))U(Q(w_{1}))>U(Q(w_{2})),V(Q(w_{1}))<V(Q(w_{2})); if w1,w2(,B]w_{1},w_{2}\in(-\infty,B], there is Q(w1)=Q(w2)Q(w_{1})=Q(w_{2}); where B=g(1M)g(0)f(Pm1)f(Pm2)B=\frac{g^{\prime}(\frac{1}{M})-g^{\prime}(0)}{f(P_{m_{1}})-f(P_{m_{2}})}, and Pm1=maxiPiP_{m_{1}}=\max_{i}P_{i}, Pm2=maxPiPm1PiP_{m_{2}}=\max_{P_{i}\neq P_{m_{1}}}P_{i}, M=#{i|Pi=Pm1}M=\#\{i|P_{i}=P_{m_{1}}\}, # denotes the cardinality of a set.

According to Theorem 2, different wws lead to different distributions, so we can change ww from 0 to BB and get a family of optimal solutions with different quality and diversity. As such, for a non-uniform PP, the Pareto-frontier is a family of distributions.

We can see quality and diversity act as a tradeoff if we want to maximize them at the same time. Since all distributions in the Pareto-frontier are Pareto-optima, trying to improve one metric for an optimum will lead to another optimum at most, thus inevitably causing another metric to drop. This result provides support for the quality-diversity tradeoff problem observed in previous works (Zhu et al., 2018; Caccia et al., 2018).

We show the result of Theorem 2 here on a special case. We pair Log-Likelihood (LL) with Shannon-Entropy (SE), the corresponding Pareto-optima can be written as

Qi=PiβZ,Z=i=1NPiβ,β0,Q_{i}=\frac{P_{i}^{\beta}}{Z},\ Z=\sum_{i=1}^{N}P_{i}^{\beta},\ \beta\geq 0,

we have w=βw=-\beta, and b=1+logZb=1+\log Z. These Pareto-optima are formerly used as quality-diversity tradeoff solutions by Li et al. (2019).

An illustration of the Pareto-frontier on a toy distribution is shown in Figure 1. We can see that quality and diversity are negatively correlated for solutions in the Pareto-frontier. Note that the ground truth distribution lies exactly on the frontier in this LL-SE case, which can be checked by setting β=1\beta=1. We will then show this is the key to the relation between quality-diversity metrics and divergence metrics.

4.3 Relationship with Divergence

To bridge the gap between the distribution-fitting goal and quality-diversity evaluation, it is necessary for the optimal solutions from divergence minimization to be consistent with that from quality-diversity maximization. Since Q=PQ=P is the optimal solution with minimum divergence and the above Pareto-frontier is the set of optimal solutions with maximal quality and diversity, we require Q=PQ=P to be in the Pareto-frontier. Theoretical results are shown in the following Theorem:

Theorem 3.

The following condition is both sufficient and necessary for Q=PQ=P to be a Pareto-optimum for any PP: there exist w00w_{0}\leq 0 and b0b_{0} that

g(x)=w00xf(u)du+b0x.g(x)=w_{0}\int_{0}^{x}f(u)\mathrm{d}u+b_{0}x.

If the above condition is satisfied, then Q=PQ=P corresponds to a Pareto-optimum with w=w0w=w_{0} and b=b0b=b_{0}, and it is the only distribution that maximize Ψ(Q)=αU(Q)+(1α)V(Q)\Psi(Q)=\alpha U(Q)+(1-\alpha)V(Q) with α=w0w01[0,1)\alpha=\frac{w_{0}}{w_{0}-1}\in[0,1), and D(P||Q)=Ψ(P)Ψ(Q)D(P||Q)=\Psi(P)-\Psi(Q) becomes a divergence metric.

We find that if quality and diversity metrics are carefully chosen, namely gg is the integral of an affine transformation of ff, we can get a divergence metric by a linear combination of these two metrics.

The LL-SE case satisfies the condition in Theorem 3. Under this special case, there is Ψ(Q)=12LL(Q)+12SE(Q)\Psi(Q)=\frac{1}{2}\mathrm{LL}(Q)+\frac{1}{2}\mathrm{SE}(Q), and

D(P||Q)=12i=1NQilogQiPi,D(P||Q)=\frac{1}{2}\sum_{i=1}^{N}Q_{i}\cdot\log\frac{Q_{i}}{P_{i}},

which is exactly the Reverse KL divergence if the constant 12\frac{1}{2} is ignored. This linearly combined divergence metric can be viewed as a tangent line of the Pareto-frontier curve in Figure 1, and the real distribution is the tangent point.

Since such condition is also necessary, the real distribution is unlikely to be a Pareto-optima if we use casually chosen metrics. This means, there would be one distribution achieving both higher quality and higher diversity than the ground truth, which is implausible. Therefore, if the condition in Theorem 3 is not satisfied, it would be unlikely to measure the divergence using a combination of quality and diversity.

Now we can conclude that, it is sufficient to reflect the distribution-fitting goal by a hybrid of quality-diversity evaluation. However, specific metrics should be chosen carefully, in order to avoid the potential violation of such property. Suppose such property is violated severely, featured by a huge gap between the ground truth distribution and the Pareto-frontier, then a model which perfectly fits the real distribution would be significantly outperformed by another model over both quality and diversity, resulting in misleading conclusions.

Therefore in the next section, we will examine the existence of the gap for quality-diversity metrics used in practice, and provide suggestions on the choice of quality-diversity metrics.

5 Options for Quality-Diversity Metrics

It is yet to be examined that whether existing quality-diversity metrics are sufficient to reflect the distribution-fitting goal. For metrics satisfying our defined general form in Section 3.1, conclusions can be drawn directly by applying Theorem 3. For example, the Log-likelihood (LL) is widely used as quality metric, which is correspondent to NLL-oracle (Yu et al., 2017) and Reverse PPL (Subramanian et al., 2018). As proved above, LL satisfies the condition in Theorem 3 if it’s paired with Shannon Entropy (SE). Consequently, it is safe to use LL-SE together as in the work of Alihosseini et al. (2019).

However for most scenarios with real text data, the calculation is intractable for the general form of quality-diversity in Section 3.1 as the ground truth distribution is unknown, including the LL-SE pair. Practical metrics (e.g. BLEU and Self-BLEU) thus usually fall out of this framework, and Theorem 3 cannot be applied directly. In order to make a judgement on such metrics, we suggest to consider the compatibility between divergence and quality-diversity metric pair. We say a pair of quality-diversity metrics is divergence-compatible if the real distribution is a Pareto-optimum under the MOP problem maximizing both metrics. Such compatibility is a necessary condition for the existence of a corresponding divergence metric which is strictly monotonically decreasing w.r.t. both quality and diversity.

5.1 BLEU and Self-BLEU

BLEU (Papineni et al., 2002) and Self-BLEU (Zhu et al., 2018) are common metrics for quality and diversity evaluation, respectively. Intuitively, BLEU measures the nn-gram overlap between a candidate set of generated text and a reference set of real text, while Self-BLEU is the average BLEU score of each generated text with other candidates as reference. High BLEU score means that nn-grams in generated text are more likely to appear in real text, thus BLEU can be used as quality metric. Similarly, high Self-BLEU score means that generated text are similar to each other in terms of nn-gram, thus Negative Self-BLEU (NSBLEU as abbreviation) can be used as diversity metric.

The expression of BLEU on a candidate set CC is:

BLEU=BPexp(1Mn=1Mlogpn),\displaystyle\mathrm{BLEU}=BP\cdot exp(\frac{1}{M}\sum_{n=1}^{M}\log p_{n}),
pn=cCgramncCountclip(gramn)cCgramncCount(gramn),\displaystyle p_{n}=\frac{\sum_{c\in C}\sum_{gram_{n}\in c}Count_{clip}(gram_{n})}{\sum_{c^{\prime}\in C}\sum_{gram^{\prime}_{n}\in c^{\prime}}Count(gram^{\prime}_{n})},

where BPBP is the Brevity Penalty which penalizes short sentences, and MM denotes the maximum nn-gram order. pnp_{n} is a precision term, which measures the proportion of grams in the candidate set that also appear in the reference set. BLEU is the geometric mean of pnp_{n} for all nMn\leq M, multiplied by a penalty term.

The expression of BLEU does not seem to satisfy the general form of quality/diversity defined in Section 3.1. However on some special case, the general form is still satisfied, upon which we show some symptoms indicating the incompatibility of BLEU-NSBLEU. Assume the lengths of text are all 11, so that M=1M=1 and BP1BP\equiv 1. In this case, BLEU contains only one term, i.e. BLEU=p1\mathrm{BLEU}=p_{1}. Then for candidate set CC and reference set RR, the expectation of BLEU and NSBLEU over generated distribution QQ and real distribution PP would be

𝔼CQ,RPBLEU(C,R)=i=1NQi[1(1Pi)|R|],\displaystyle\mathop{\mathbb{E}}\limits_{C\sim Q,R\sim P}\ \mathrm{BLEU}(C,R)=\sum_{i=1}^{N}Q_{i}\cdot[1-(1-P_{i})^{|R|}],
𝔼CQNSBLEU(C)=i=1NQi[1(1Qi)|C|1].\displaystyle\mathop{\mathbb{E}}\limits_{C\sim Q}\ \mathrm{NSBLEU}(C)=-\sum_{i=1}^{N}Q_{i}\cdot[1-(1-Q_{i})^{|C|-1}].

Such expressions satisfy the general form with

f(x)=1(1x)|R|,g(x)=x+x(1x)|C|1.\displaystyle f(x)=1-(1-x)^{|R|},\quad g(x)=-x+x\cdot(1-x)^{|C|-1}.

The condition in Theorem 3 would be satisfied if and only if |R|=1|R|=1 and |C|=2|C|=2, which becomes f(x)=xf(x)=x and g(x)=x2g(x)=-x^{2}. However, the size of reference set |R||R| is usually far more than 11, under which cases the BLEU-NSBLEU metric pair would be divergence-incompatible.

Though above analysis is done on a special case, such results imply a potential incompatibility for general BLEU-NSBLEU metric pairs. We will confirm this incompatibility by an empirical approach in Section 6.

Table 1: Lower-bound of QDisc and DRate w.r.t. BLEU-NSBLEU on synthetic data with different σ\sigmas.
Metrics σ=0.5\sigma=0.5 σ=1.0\sigma=1.0 σ=2.0\sigma=2.0
QDisc DRate(%) QDisc DRate(%) QDisc DRate(%)
BS-1 0.01287 2.55 0.01509 3.29 0.01063 3.15
BS-2 0.02384 9.41 0.01699 4.27 0.01146 1.71
BS-3 2.090 ×108\times 10^{-8} <<0.01 6.045 ×106\times 10^{-6} 0.19 3.878 ×104\times 10^{-4} 0.05

5.2 The Proposed Metric Pair

To avoid possible misleading conclusions in practice, we suggest to use diversity-compatible quality-diversity metric pair.

Since the real probability P(x)P(x) is required in U(Q;P)U(Q;P) under the general form in Section 3.1, calculation of most quality metrics are intractable on real text data. The only exception is the case with f(x)=xf(x)=x, paired with g(x)=x2g(x)=-x^{2}. The linearity of ff can avoid the explicit form of P(x)P(x) by sampling from real data, i.e. U(Q)=𝔼xPQ(x)U(Q)=\mathbb{E}_{x\in P}\ Q(x). We name the corresponding quality metric as Coverage Rate (CR), and diversity metric as Negative Repetition Rate (NRR). Even so, we observe a large variance while estimating CR and NRR on real text data. This is mainly because of the extremely large space of text of N=|V|LN=|V|^{L}. Therefore, estimations of CR/NRR are highly inaccurate in the text space.

We thus suggest to calculate CR-NRR in nn-gram space rather than in text space. Derive the nn-gram distribution QgQ_{g} and PgP_{g} from text distribution QQ and PP, so that

CRn(Q;P)\displaystyle\mathrm{CR}_{n}(Q;P) =gramnSnQg(gramn)Pg(gramn),\displaystyle=\sum_{gram_{n}\in S_{n}}Q_{g}(gram_{n})\cdot P_{g}(gram_{n}),
NRRn(Q)\displaystyle\mathrm{NRR}_{n}(Q) =gramnSnQg2(gramn),\displaystyle=-\sum_{gram_{n}\in S_{n}}Q_{g}^{2}(gram_{n}),

where SnS_{n} denotes the set of all possible nn-grams. In practice, QgQ_{g} and PgP_{g} can be estimated by the empirical distribution, i.e. count the number of target nn-grams and divide by the total number. Note that if calculated by the longest nn-gram with n=Ln=L, CRn\mathrm{CR}_{n} and NRRn\mathrm{NRR}_{n} would exactly recover the original CR and NRR metric in text space, thus can be viewed as a generalized form. In the rest of this paper, we use CR-NRR as a default notation in the nn-gram space unless explicitly stated.

In the nn-grams space, calculation of metric pairs with other ff/gg functions also becomes possible. However, metrics such as LL-SE suffer from another smoothing problem on real text data, i.e. their values go to infinity if some nn-grams do not appear in candidate set or reference set. Therefore, we still suggest to use CR-NRR as a first choice.

Though there is a conversion from the text space to the nn-gram space, CR/NRR can still reflect quality/diversity. The CRn\mathrm{CR}_{n} metric measures the average probability for an nn-gram in candidate set to appear in the reference set, thus is an indicator of quality. Similarly, NRRn\mathrm{NRR}_{n} measures the average probability for an nn-gram to appear again in two consecutive sampling processes over the candidate set, thus is an indicator of diversity.

We then check the divergence-compatibility of CR-NRR evaluation. Firstly, CR-NRR is divergence-compatible w.r.t. distributions in the nn-gram space, according to Theorem 3. We name the corresponding divergence metric as CR-NRR Divergence (CND), where

Ψn(Q)=23CRn(Q;P)+13NRRn(Q),\Psi_{n}(Q)=\frac{2}{3}\mathrm{CR}_{n}(Q;P)+\frac{1}{3}\mathrm{NRR}_{n}(Q),

and

CNDn(Q;P)\displaystyle\mathrm{CND}_{n}(Q;P) =3[Ψn(P)Ψn(Q)]\displaystyle=3\cdot[\Psi_{n}(P)-\Psi_{n}(Q)]
=\displaystyle= gramnSn[Qg(gramn)Pg(gramn)]2.\displaystyle\sum_{gram_{n}\in S_{n}}[Q_{g}(gram_{n})-P_{g}(gram_{n})]^{2}.

Secondly, CR-NRR is also divergence-compatible w.r.t. distributions in the text space. Assume Q=PQ=P is dominated by QQ^{\prime} under CR-NRR evaluation, which means Qg=PgQ_{g}=P_{g} would also be dominated by QgQ^{\prime}_{g}. This cause contradiction with the compatibility in nn-gram space, so the compatibility in text space also holds.

In addition to the divergence-compatibility property, CR-NRR is also easy to acquire. It does not require the explicit value of P(x)P(x) or Q(x)Q(x), thus can be applied on implicit models similarly to BLEU-NSBLEU. Moreover, the time complexity of CR-NRR algorithm is O(m+n)O(m+n), which is much lower than BLEU-NSBLEU with O(m(m+n))O(m\cdot(m+n)), where mm and nn denote the size of candidate and reference set respectively. To conclude, we suggest to use CR-NRR in nn-gram space for quality-diversity evaluation, instead of BLEU-NSBLEU.

6 Experiments

In this section, we perform compatibility analysis of BLEU-NSBLEU, compared with CR-NRR on both synthetic data and real text data. We show that BLEU-NSBLEU is significantly divergence-incompatible, by observing a phenomenon that ground truth text data are clearly outperformed over both BLEU and NSBLEU by some manually constructed model. We also show that CR/NRR are representative for quality/diversity evaluation respectively, while CND is representative for divergence evaluation.

To measure the degree of incompatibility, we calculate the Quality Discrepancy (QDisc) and Discrepancy Rate (DRate):

QDisc\displaystyle\mathrm{QDisc} =maxQU(Q)U(P),s.t.V(Q)V(P),\displaystyle=\max_{Q}U(Q)-U(P),\ s.t.\ V(Q)\geq V(P),
DRate\displaystyle\mathrm{DRate} =QDiscmaxQU(Q)U(Q),Q=argmaxQV(Q).\displaystyle=\frac{\mathrm{QDisc}}{\max_{Q}U(Q)-U(Q^{\prime})},\ Q^{\prime}=\mathop{argmax}\limits_{Q}V(Q).

Intuitively, we try to find a model with best quality while its diversity is no lower than that of real distribution. Then QDisc measures the difference between this model and the real distribution in terms of quality. DRate measures the ratio between QDisc and the total range of quality for all Pareto-optima. A metric pair is divergence-compatible if and only if QDisc=0\mathrm{QDisc}=0.

6.1 Experiments on Synthetic Data

Refer to caption
Refer to caption
(a) MSCOCO dataset
Refer to caption
Refer to caption
(b) WMT dataset
Figure 2: Evaluation of BLEU-NSBLEU and CR-NRR on real text data. Test data are random text from reference set, mixed with noise with a proportion of ϵ=[0.0,0.2,0.4,0.6]\epsilon=[0.0,0.2,0.4,0.6] from right to left.
Table 2: Estimation of QDisc, DRate, Self-Ratio, and Ref-Ratio on real text data.
Metrics MSCOCO WMT
QDisc DRate(%) Self-Ratio Ref-Ratio QDisc DRate(%) Self-Ratio Ref-Ratio
BS-2 0.032 3.2 0.034 0.314 0.034 3.4 0.036 0.26
BS-3 0.090 9.0 0.104 0.814 0.117 11.7 0.145 0.88
BS-4 0.162 16.2 0.219 1.46 0.211 21.1 0.339 1.59
CN-2 0.75×106\times 10^{-6} 0.013 0.0005 0.006 3.69×107\times 10^{-7} 0.016 0.0008 0.025
CN-3 1.07×106\times 10^{-6} 0.079 0.0063 0.087 3.45×107\times 10^{-7} 0.098 0.0109 0.358
CN-4 1.15×106\times 10^{-6} 0.163 0.0247 0.421 3.12×107\times 10^{-7} 0.220 0.0525 2.092

We first run experiments on synthetic data rather than real text data, in order to get the precise values of all metrics. Under this setting, the information of generated distribution QQ and real distribution PP are explicitly given in advance, thus eliminates the possible variance from sampling. The synthetic data are texts with length LL using a pseudo vocabulary VV. We construct the real distribution using an oracle LSTM model as in SeqGAN (Yu et al., 2017), whose weights are randomly sampled from a gaussian distribution with μ=0\mu=0. Different standard deviation σ\sigmas are applied to get several synthetic real distributions with different levels of entropy, i.e. distribution with smaller σ\sigma is more flat and of higher entropy, and distribution with larger σ\sigma is more sharp and of lower entropy.

Calculation of QDisc and DRate can be achieved by a simple binary-search algorithm if the exact form of Pareto-frontier is known. However for BLEU-NSBLEU metric pair, the frontier is unknown since Theorem 2 cannot be applied in this case. Consequently, we opt to used an optimization-based method for the estimation of QDisc. We try to solve the following optimization problem using stochastic gradient descent (SGD) with momentum:

Q=argmaxQU(Q)λmax(0,V(P)V(Q)),Q^{*}=\mathop{\mathrm{argmax}}\limits_{Q}\ U(Q)-\lambda\cdot\max(0,V(P)-V(Q)),

where λ\lambda is a penalty term to discourage the case where divergence is lower than real distribution PP. We set λ=2.0\lambda=2.0 in our experiments. So that QDisc=U(Q)U(P)\mathrm{QDisc}=U(Q^{*})-U(P), and the denominator in DRate is also calculated through such optimization-based method.

For BLEU metric with candidate set size mm and reference set size nn, the expectation can be directly calculated by

𝔼CQ,RPBLEU(C,R)=\displaystyle\mathop{\mathbb{E}}\limits_{C\sim Q,R\sim P}\ \mathrm{BLEU}(C,R)=
CVLm,RVLni=1mQ(Ci)j=1nP(Rj)BLEU(C,R).\displaystyle\sum\limits_{C\in V^{L\cdot m},R\in V^{L\cdot n}}\ \prod_{i=1}^{m}Q(C_{i})\cdot\prod_{j=1}^{n}P(R_{j})\cdot\mathrm{BLEU}(C,R).

The time complexity (number of terms) of such calculation is O(|V|L(m+n))O(|V|^{L\cdot(m+n)}). This is intolerable for above optimization problem even in text space of normal size. As a result, we set |V|=4,L=3,m=1,n=2|V|=4,L=3,m=1,n=2, and apply SGD under the Tensorflow framework111Slight increase of any parameter will consume intolerably more time, and is not necessary for the conclusions..

We use CN-n and BS-n as abbreviation for CR-NRR and BLEU-NSBLEU with nn-gram, respectively. We report the QDisc and DRate of BLEU-NSBLEU in Table 1. Note that the reported QDisc values are corresponding lower bounds, since the optimization-based method does not guarantee a global optimum. These non-zero QDisc values provide a clear support for the incompatibility of BLEU-NSBLEU. We can also see that such discrepancy is significant on some cases, e.g. QDisc>0.02\mathrm{QDisc}>0.02 and DRate=9.41%\mathrm{DRate}=9.41\% for BS-2 on data with σ=0.5\sigma=0.5. A QDisc value of 0.02 means that, we cannot surely claim that a model is better than another when the quality gap is below 0.02, which is already a clear gap for BLEU. We also run similar experiments for CR-NRR. However, no positive lower bound is observed, which is in accordance with our theory.

6.2 Experiments on Real Text Data

Significance of quality discrepancy varies on different cases, thus we care about the discrepancies on real text data. We use two public datasets, MSCOCO Image Caption dataset (Chen et al., 2015) and EMNLP2017 WMT News dataset222http://statmt.org/wmt17/translation-task.html. We use 50,000 sentences as candidate set and another 50,000 as reference set for each dataset 333See supplementary material for detailed configurations..

To provide an estimation of QDisc and DRate, we manually construct a family of strong models. We mix the empirical distribution P~\tilde{P} with truncated uniform distribution MM under different proportions, i.e. Q=(1ϵ)P~+ϵMQ=(1-\epsilon)\cdot\tilde{P}+\epsilon\cdot M. During text generation, a random text from reference set is sampled with probability 1ϵ1-\epsilon, otherwise a text with random tokens of length LL^{\prime} is constructed with probability ϵ\epsilon. We try both L=5L^{\prime}=5 and L=LL^{\prime}=L, and report the case with larger QDisc value.

We estimate QDisc by a linear interpolation between two closest points on the curve w.r.t. quality of real data. For the denominator of DRate in BLEU-NSBLEU, we use 1.01.0 directly, since BLEU=1\mathrm{BLEU}=1 is reached for highest quality with ϵ=0.0\epsilon=0.0, and BLEU0\mathrm{BLEU}\approx 0 for highest diversity with ϵ=1.0\epsilon=1.0. For CR-NRR, CR goes to 0 when diversity is maximized with ϵ=1.0\epsilon=1.0. As for the maximal value of CR, we estimate it by using a single reference sentence as candidate and select the one with maximal CR value.

For a clearer view of the significance of quality discrepancy, we introduce two additional metrics: Self-Ratio and Ref-Ratio. Self-Ratio calculates the ratio between QDisc and the quality of candidate set. Ref-Ratio calculates the ratio between QDisc and the quality difference of ϵ=0.0\epsilon=0.0 and ϵ=0.2\epsilon=0.2. The evaluation results of BLEU-NSBLEU and CR-NRR with 33-gram under L=5L^{\prime}=5 are shown in Figure 2.

We can see that real data stays close to the CR-NRR curve, while a much larger gap is observed between real data and the BLEU-NSBLEU curve. We give the values of QDisc, DRate, Self-Ratio, and Ref-Ratio in Table 2. BLEU-NSBLEU shows a significant incompatibility, by QDisc values ranging from 0.032 to 0.211. Such huge discrepancy in BLEU is unbearable in real applications, e.g. we cannot claim a model is better than another even if it achieves higher NSBLEU and significantly higher BLEU. As a result, we suggest not to use BLEU-NSBLEU in order to avoid misleading conclusions. CR-NRR also shows a small positive discrepancy, this is due to the inevitable difference between the empirical distributions of candidate set and reference set. However, discrepancy caused by such distribution difference is generally much smaller than BLEU-NSBLEU. We also observe that DRate grows quickly as nn-gram becomes longer for CR-NRR, thus we suggest to use CR-NRR with short nn-gram such as CN-2 or CN-3.

Refer to caption
Refer to caption
(a) MSCOCO dataset
Refer to caption
Refer to caption
(b) WMT dataset
Figure 3: Evaluation of CR-NRR and CND on real text data. Test data are generated by temperature-sweep on pre-trained RNNLMs.

Next we show how CR/NRR/CND behave on real text data. We apply temperature sweep on an RNN-based language model (RNNLM) pre-trained by maximum likelihood estimation, which is a quick way to get a family of models with quality-diversity tradeoff according to works of Caccia et al. (2018). The RNNLM consists of an embedding layer, an LSTM layer, and a fully-connected output layer. The embedding dimension and number of hidden nodes are all set to 128. We train the model using Adam (Kingma & Ba, 2014) optimizer with learning rate 0.0010.001 by 30 epochs. As temperature tt grows, the model becomes more close to uniform, so that quality decreases and diversity increases, and minimal divergence is taken near t=1.0t=1.0. Results are shown in Figure 3, where we can see CR/NRR/CND are representative for quality/diversity/divergence respectively, which clearly fit our expectations. Therefore, we suggest to use CR-NRR for quality-diversity evaluation.

7 Discussion

Our above conclusions are mainly drawn under the unconditional text generation setting, however, quality-diversity evaluation is also getting great attentions under conditional text generation settings, such as dialogue system (Vijayakumar et al., 2016), machine translation (Shen et al., 2019) and image captioning (Ippolito et al., 2019). In this section, we give a brief discussion about quality-diversity evaluation under conditional text generation settings.

Due to different formalization of quality and diversity metrics, our conclusions cannot be directly transferred to conditional text generation settings. Under these settings, the quality of text xx under condition cc is still defined as monotonically increasing w.r.t. the real conditional probability P(x|c)P(x|c). So that the overall quality metric becomes the expectation of text quality over xx and cc, which is the case for BLEU. Meanwhile, diversity metrics have two different understandings. One is defined as the average diversity of conditional model distribution Q(x|c)Q(x|c) under different cc, such as Pairwise-BLEU (Shen et al., 2019). The other is define as the diversity of marginal model distribution Q(x)=cP(c)Q(x|c)Q(x)=\sum_{c}P(c)Q(x|c), such as Distinct (Li et al., 2015). Formalization of both quality and diversity metrics depart from ours in Section 3.1, and may result in different conclusions, thus require further separate analysis. Though such analyses are not covered here, our work provides a paradigm for future theoretical analysis, including metric definition, Pareto-optimality analysis, and divergence-compatibility judgement.

Another difference lies in the point of view of task goal. While the goal of unconditional text generation is to design models that better fit the text distribution, in conditional text generation however, better human evaluation results are viewed as final goal in most cases. Therefore in these cases, the main focus would be designing metrics that better reflect human evaluation as well as designing training objectives that achieve better evaluation. It is also anticipated that whether human evaluation is compatible with divergence. We regard these as our future work.

8 Conclusion

In this paper, we give theoretical analysis of the relation between quality-diversity evaluation and distribution-fitting goal. We show that when using properly paired quality-diversity metrics, i.e. g(x)g(x) is the integral of an affine transformation of f(x)f(x), a linear combination of quality and diversity constitutes a divergence metric between the generated distribution and the real distribution. For metrics used in practice, we show the commonly used BLEU and Self-BLEU metric pair fails to reflect the distribution-fitting goal. For a substitute, we suggest to use CR-NRR instead as quality-diversity metric pair.

Acknowledgement

This work was supported by Beijing Academy of Artificial Intelligence (BAAI) under Grants No. BAAI2019ZD0306, and BAAI2020ZJ0303, the National Natural Science Foundation of China (NSFC) under Grants No. 61722211, 61773362, 61872338, 61902381, and 61906180, the Youth Innovation Promotion Association CAS under Grants No. 20144310, and 2016102, the National Key RD Program of China under Grants No. 2016QY02D0405, the Lenovo-CAS Joint Lab Youth Scientist Project.

References

  • Alihosseini et al. (2019) Alihosseini, D., Montahaei, E., and Baghshah, M. S. Jointly measuring diversity and quality in text generation models. In Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pp.  90–98, 2019.
  • Bahdanau et al. (2014) Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  • Caccia et al. (2018) Caccia, M., Caccia, L., Fedus, W., Larochelle, H., Pineau, J., and Charlin, L. Language gans falling short. arXiv preprint arXiv:1811.02549, 2018.
  • Chen et al. (2018) Chen, L., Dai, S., Tao, C., Zhang, H., Gan, Z., Shen, D., Zhang, Y., Wang, G., Zhang, R., and Carin, L. Adversarial text generation via feature-mover’s distance. In Advances in Neural Information Processing Systems, pp. 4666–4677, 2018.
  • Chen et al. (1998) Chen, S. F., Beeferman, D., and Rosenfeld, R. Evaluation metrics for language models. In DARPA Broadcast News Transcription and Understanding Workshop, pp.  275–280. Citeseer, 1998.
  • Chen et al. (2015) Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Dollár, P., and Zitnick, C. L. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  • d’Autume et al. (2019) d’Autume, C. d. M., Rosca, M., Rae, J., and Mohamed, S. Training language gans from scratch. arXiv preprint arXiv:1905.09922, 2019.
  • Fedus et al. (2018) Fedus, W., Goodfellow, I., and Dai, A. M. Maskgan: Better text generation via filling in the _. arXiv preprint arXiv:1801.07736, 2018.
  • Gao et al. (2019) Gao, X., Lee, S., Zhang, Y., Brockett, C., Galley, M., Gao, J., and Dolan, B. Jointly optimizing diversity and relevance in neural response generation. arXiv preprint arXiv:1902.11205, 2019.
  • Gulrajani et al. (2017) Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767–5777, 2017.
  • Guo et al. (2017) Guo, J., Lu, S., Cai, H., Zhang, W., Yu, Y., and Wang, J. Long text generation via adversarial training with leaked information. arXiv preprint arXiv:1709.08624, 2017.
  • Hashimoto et al. (2019) Hashimoto, T. B., Zhang, H., and Liang, P. Unifying human and statistical evaluation for natural language generation. arXiv preprint arXiv:1904.02792, 2019.
  • Ippolito et al. (2019) Ippolito, D., Kriz, R., Sedoc, J., Kustikova, M., and Callisonburch, C. Comparison of diverse decoding methods from conditional language models. pp.  3752–3762, 2019.
  • Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Li et al. (2015) Li, J., Galley, M., Brockett, C., Gao, J., and Dolan, B. A diversity-promoting objective function for neural conversation models. arXiv preprint arXiv:1510.03055, 2015.
  • Li et al. (2017) Li, J., Monroe, W., Shi, T., Jean, S., Ritter, A., and Jurafsky, D. Adversarial learning for neural dialogue generation. arXiv preprint arXiv:1701.06547, 2017.
  • Li et al. (2019) Li, J., Lan, Y., Guo, J., Xu, J., and Cheng, X. Differentiated distribution recovery for neural text generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp.  6682–6689, 2019.
  • Lin & Och (2004) Lin, C.-Y. and Och, F. Looking for a few good metrics: Rouge and its evaluation. In Ntcir Workshop, 2004.
  • Lin et al. (2017) Lin, K., Li, D., He, X., Zhang, Z., and Sun, M.-T. Adversarial ranking for language generation. In Advances in Neural Information Processing Systems, pp. 3155–3165, 2017.
  • Lu et al. (2018a) Lu, S., Yu, L., Zhang, W., and Yu, Y. Cot: Cooperative training for generative modeling of discrete data. arXiv preprint arXiv:1804.03782, 2018a.
  • Lu et al. (2018b) Lu, S., Zhu, Y., Zhang, W., Wang, J., and Yu, Y. Neural text generation: Past, present and beyond. arXiv preprint arXiv:1803.07133, 2018b.
  • Mikolov et al. (2010) Mikolov, T., Karafiát, M., Burget, L., Černockỳ, J., and Khudanpur, S. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association, 2010.
  • Nie et al. (2018) Nie, W., Narodytska, N., and Patel, A. Relgan: Relational generative adversarial networks for text generation. 2018.
  • Papineni et al. (2002) Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp.  311–318. Association for Computational Linguistics, 2002.
  • Rennie et al. (2017) Rennie, S. J., Marcheret, E., Mroueh, Y., Ross, J., and Goel, V. Self-critical sequence training for image captioning. In CVPR, volume 1, pp.  3, 2017.
  • Semeniuta et al. (2018) Semeniuta, S., Severyn, A., and Gelly, S. On accurate evaluation of gans for language generation. arXiv preprint arXiv:1806.04936, 2018.
  • Shen et al. (2019) Shen, T., Ott, M., Auli, M., and Ranzato, M. Mixture models for diverse machine translation: Tricks of the trade. arXiv: Computation and Language, 2019.
  • Subramanian et al. (2018) Subramanian, S., Mudumba, S. R., Sordoni, A., Trischler, A., Courville, A. C., and Pal, C. Towards text generation with adversarially learned neural outlines. In Advances in Neural Information Processing Systems, pp. 7551–7563, 2018.
  • Vijayakumar et al. (2016) Vijayakumar, A. K., Cogswell, M., Selvaraju, R. R., Sun, Q., Lee, S., Crandall, D. J., and Batra, D. Diverse beam search: Decoding diverse solutions from neural sequence models. arXiv: Artificial Intelligence, 2016.
  • Yu et al. (2017) Yu, L., Zhang, W., Wang, J., and Yu, Y. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, pp.  2852–2858, 2017.
  • Zhang et al. (2018) Zhang, H., Lan, Y., Guo, J., Xu, J., and Cheng, X. Tailored sequence to sequence models to different conversation scenarios. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1479–1488, 2018.
  • Zhang et al. (2017a) Zhang, J., Feng, Y., Wang, D., Wang, Y., Abel, A., Zhang, S., and Zhang, A. Flexible and creative chinese poetry generation using neural memory. arXiv preprint arXiv:1705.03773, 2017a.
  • Zhang et al. (2017b) Zhang, Y., Gan, Z., Fan, K., Chen, Z., Henao, R., Shen, D., and Carin, L. Adversarial feature matching for text generation. arXiv preprint arXiv:1706.03850, 2017b.
  • Zhu et al. (2018) Zhu, Y., Lu, S., Zheng, L., Guo, J., Zhang, W., Wang, J., and Yu, Y. Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp.  1097–1100. ACM, 2018.

Appendix

Appendix A Preliminaries

Before starting the proofs, we first introduce some preliminaries on the constrained convex optimization problem. Assume f(x)f(x), ci(x)c_{i}(x), and hj(x)h_{j}(x) are continuous differentiable function define on n\mathbb{R}^{n}, consider the constrained convex optimization problem defined as follows:

minxn\displaystyle\min_{x\in\mathbb{R}^{n}} f(x)\displaystyle f(x) (1)
s.t.\displaystyle s.t. ci(x)0,i=1,2,,k\displaystyle c_{i}(x)\leq 0,\quad i=1,2,\cdots,k
hj(x)=0,j=1,2,,l\displaystyle h_{j}(x)=0,\quad j=1,2,\cdots,l

The optimal solutions for above problem are given by the Lagrange Multiplier approach , as shown in the following theorem:

Theorem 4.

Assume f(x)f(x) and ci(x)c_{i}(x) are convex, hj(x)h_{j}(x) are affine, and cic_{i} are strictly feasible (there exists one xx satisfying ci(x)<0c_{i}(x)<0 for all ii). Define the Lagrange function as:

L(x,α,β)=f(x)+i=1kαici(x)+j=1lβjhj(x),L(x,\alpha,\beta)=f(x)+\sum_{i=1}^{k}\alpha_{i}c_{i}(x)+\sum_{j=1}^{l}\beta_{j}h_{j}(x),

where α0\alpha\geq 0. Then the the following conditions are both sufficient and necessary for xx to be a solution in problem 1.

xL(x,α,\displaystyle\nabla_{x}L(x^{*},\alpha^{*}, β)=0\displaystyle\beta^{*})=0 (2)
αL(x,α,\displaystyle\nabla_{\alpha}L(x^{*},\alpha^{*}, β)=0\displaystyle\beta^{*})=0
βL(x,α,\displaystyle\nabla_{\beta}L(x^{*},\alpha^{*}, β)=0\displaystyle\beta^{*})=0
αici(x)=0,i\displaystyle\alpha_{i}^{*}c_{i}(x^{*})=0,\quad i =1,2,,k\displaystyle=1,2,\cdots,k
ci(x)0,i\displaystyle c_{i}(x^{*})\leq 0,\quad i =1,2,,k\displaystyle=1,2,\cdots,k
αi0,i\displaystyle\alpha_{i}^{*}\geq 0,\quad i =1,2,,k\displaystyle=1,2,\cdots,k
hj(x)=0,j\displaystyle h_{j}(x^{*})=0,\quad j =1,2,,k\displaystyle=1,2,\cdots,k

The conditions in Equation 2 are called the Karush-Kuhn-Tucker(KKT) conditions.

Appendix B Proof of Theorem 11

For property 1, from U(Q)U(Q)=ϵf(Pi)ϵf(Pj)=ϵ[f(Pi)f(Pj)]>0U(Q^{\prime})-U(Q)=\epsilon f(P_{i})-\epsilon f(P_{j})=\epsilon[f(P_{i})-f(P_{j})]>0, we get f(Pi)>f(Pj)f(P_{i})>f(P_{j}). We then get the conclusion by setting x1=Pix_{1}=P_{i} and x2=Pjx_{2}=P_{j}.

For property 2, V(Q)V(Q)=[g(Qi+ϵ)+g(Qjϵ)][g(Qi)+g(Qj)]<0V(Q^{\prime})-V(Q)=[g(Q_{i}+\epsilon)+g(Q_{j}-\epsilon)]-[g(Q_{i})+g(Q_{j})]<0 is true for any Qi>QjQ_{i}>Q_{j}. Denote C=Qi+QjC=Q_{i}+Q_{j} and r(x)=g(x)+g(Cx)r(x)=g(x)+g(C-x), then we have V(Q)V(Q)=r(Qi+ϵ)r(Qi)<0V(Q^{\prime})-V(Q)=r(Q_{i}+\epsilon)-r(Q_{i})<0 for any Qi,ϵQ_{i},\epsilon. Since 0<Qi<Qi+ϵ<10<Q_{i}<Q_{i}+\epsilon<1, we need r(x)<0r^{\prime}(x)<0 for x(0,1)x\in(0,1). Then, since r(x)=g(x)g(Cx)<0r^{\prime}(x)=g^{\prime}(x)-g^{\prime}(C-x)<0 is true for any 0<Cx<x<10<C-x<x<1. Set x1=Cxx_{1}=C-x and x2=xx_{2}=x and we get g(x1)<g(x2)g^{\prime}(x_{1})<g^{\prime}(x_{2}) for any x1>x2>0x_{1}>x_{2}>0 and x1+x2=Qi+Qj1x_{1}+x_{2}=Q_{i}+Q_{j}\leq 1.

Appendix C Lemmas

We give two lemmas to support the proof of Theorem 22 and Theorem 33.

C.1 Lemma 11

Lemma 1.

If QQ is a Pareto-optimum, then the following conditions are satisfied: if Pi>PjP_{i}>P_{j}, then QiQjQ_{i}\geq Q_{j}; if Pi=PjP_{i}=P_{j}, then Qi=QjQ_{i}=Q_{j}.

If Pi>PjP_{i}>P_{j}, assume Qi<QjQ_{i}<Q_{j}, we can construct QQ^{\prime} where Qk=QkQ^{\prime}_{k}=Q_{k} for all ki,jk\neq i,j and Qi=Qj,Qj=QiQ^{\prime}_{i}=Q_{j},Q^{\prime}_{j}=Q_{i}. As such, V(Q)=V(Q)V(Q^{\prime})=V(Q) but U(Q)U(Q)=(QjQi)[f(Pi)f(Pj)]>0U(Q^{\prime})-U(Q)=(Q_{j}-Q_{i})[f(P_{i})-f(P_{j})]>0. This means QQ is dominated by QQ^{\prime}, which conflicts with the fact that QQ is a Pareto-optimum. So QiQjQ_{i}\geq Q_{j}.

If Pi=PjP_{i}=P_{j}, assume QiQjQ_{i}\neq Q_{j}, and we can further assume Qi>QjQ_{i}>Q_{j}. Again we construct QQ^{\prime} where Qk=QkQ^{\prime}_{k}=Q_{k} for all ki,jk\neq i,j and Qi=Qj=Qi+Qj2Q^{\prime}_{i}=Q^{\prime}_{j}=\frac{Q_{i}+Q_{j}}{2}. Surely we have U(Q)=U(Q)U(Q^{\prime})=U(Q), and V(Q)V(Q)=2g(Qi+Qj2)g(Qi)g(Qj)V(Q^{\prime})-V(Q)=2g(\frac{Q_{i}+Q_{j}}{2})-g(Q_{i})-g(Q_{j}). Since gg is strictly concave, we have V(Q)V(Q)>0V(Q^{\prime})-V(Q)>0, which means QQ is dominated by QQ^{\prime}. This causes confliction, so Qi=QjQ_{i}=Q_{j}.

C.2 Lemma 22

Lemma 2.

Assume α[0,1)\alpha\in[0,1) and Ψ(Q)=αU(Q)+(1α)V(Q)\Psi(Q)=\alpha U(Q)+(1-\alpha)V(Q), then the distribution QQ that maximize Ψ(Q)\Psi(Q) satisfies Qi=g^1[wf(Pi)+b]Q_{i}=\hat{g}^{\prime-1}[w\cdot f(P_{i})+b], and w=αα1w=\frac{\alpha}{\alpha-1}.

Define the optimization problem as follows:

minQαU(Q)(1\displaystyle\min_{Q}-\alpha\cdot U(Q)-(1 α)V(Q)\displaystyle-\alpha)V(Q)
s.t. 1i=1NQi\displaystyle s.t.\ 1-\sum_{i=1}^{N}Q_{i} =0\displaystyle=0
i,Qi\displaystyle\forall i,\ -Q_{i} 0\displaystyle\leq 0

Again we first check that the prerequisites in KKT are all satisfied. U(Q)-U(Q) is linear and V(Q)-V(Q) is convex w.r.t. QQ; 1i=1NQi1-\sum_{i=1}^{N}Q_{i} is affine w.r.t. QQ; since all QiQ_{i} can be positive, so the inequalities are all strictly feasible.

The Lagrange function is:

L(Qi,λ,ξi)=\displaystyle L(Q_{i},\lambda,\xi_{i})= αi=1NQif(Pi)(1α)i=1Ng(Qi)\displaystyle-\alpha\sum_{i=1}^{N}Q_{i}f(P_{i})-(1-\alpha)\sum_{i=1}^{N}g(Q_{i})
+λ(1i=1NQi)i=1NξiQi,ξ0.\displaystyle+\lambda(1-\sum_{i=1}^{N}Q_{i})-\sum_{i=1}^{N}\xi_{i}Q_{i},\quad\xi\geq 0.

Apply KKT and we get the following conditions for a optimal solution:

i,LQi=αf(Pi)(1\displaystyle\forall i,\ \frac{\partial L}{\partial Q_{i}}=-\alpha f(P_{i})-(1 α)g(Qi)λξi=0,\displaystyle-\alpha)g^{\prime}(Q_{i})-\lambda-\xi_{i}=0,
i,ξiQi\displaystyle\forall i,\ -\xi_{i}Q_{i} =0\displaystyle=0

For Qi0Q_{i}\neq 0, there is ξi=0\xi_{i}=0, so

Qi=g1[αα1f(Pi)+λα1];Q_{i}=g^{\prime-1}[\frac{\alpha}{\alpha-1}f(P_{i})+\frac{\lambda}{\alpha-1}];

for Qi=0Q_{i}=0, there is ξi>0\xi_{i}>0, so

αα1f(Pi)+λα1>g(0).\frac{\alpha}{\alpha-1}f(P_{i})+\frac{\lambda}{\alpha-1}>g^{\prime}(0).

Denote w=αα1w=\frac{\alpha}{\alpha-1} and b=λα1b=\frac{\lambda}{\alpha-1} and combine the two cases together, we get:

Qi=g^1[wf(Pi)+b],w0,Q_{i}=\hat{g}^{\prime-1}[w\cdot f(P_{i})+b],\quad w\leq 0,

The above derivation is both sufficient and necessary, so we finished the proof.

Appendix D Proof of Theorem 22

We give the proofs for three conclusions individually.

D.1 Conclusion 11

Here we only consider the case with U(Q)maxQU(Q)U(Q)\neq\max_{Q}U(Q), and the case where U(Q)=maxQU(Q)U(Q)=\max_{Q}U(Q) will be incorporated into conclusion 3. We try to find a distribution QQ^{\prime} with the highest diversity while quality is not lower than QQ. Define a convex optimization problem as follows:

minQV(Q)\displaystyle\min_{Q^{\prime}}-V(Q^{\prime})
s.t.U(Q)U(Q)\displaystyle s.t.\ U(Q)-U(Q^{\prime}) 0\displaystyle\leq 0
1i=1NQi\displaystyle 1-\sum_{i=1}^{N}Q^{\prime}_{i} =0\displaystyle=0
i,Qi\displaystyle\forall i,\ -Q^{\prime}_{i} 0\displaystyle\leq 0

For QQ to be a Pareto-optimum, it’s necessary for Q=QQ^{\prime}=Q to be a solution of above problem. Thus we try to solve this problem next.

We first check that the prerequisites in KKT are all satisfied. V(Q)-V(Q^{\prime}) is convex w.r.t. QQ^{\prime}; 1i=1NQi1-\sum_{i=1}^{N}Q^{\prime}_{i} is affine w.r.t. QQ^{\prime}; U(Q)U(Q)U(Q)-U(Q^{\prime}) and Qi-Q^{\prime}_{i} are convex(linear) w.r.t QQ^{\prime}; since all QiQ^{\prime}_{i} can be positive and U(Q)maxQU(Q)U(Q)\neq\max_{Q}U(Q), so the inequalities are all strictly feasible.

The Lagrange function is:

L(Qi,λ,η,ξi)=\displaystyle L(Q^{\prime}_{i},\lambda,\eta,\xi_{i})= i=1Ng(Qi)+λ(1i=1NQi)\displaystyle-\sum_{i=1}^{N}g(Q^{\prime}_{i})+\lambda(1-\sum_{i=1}^{N}Q^{\prime}_{i})
+ηi=1N(QiQi)f(Pi)i=1NξiQi,\displaystyle+\eta\sum_{i=1}^{N}(Q_{i}-Q^{\prime}_{i})f(P_{i})-\sum_{i=1}^{N}\xi_{i}Q^{\prime}_{i},
η,ξ0.\displaystyle\eta,\xi\geq 0.

Apply KKT and we get the following conditions for a optimal solution:

i,LQi=g(Qi)ληf(Pi)ξi\displaystyle\forall i,\ \frac{\partial L}{\partial Q^{\prime}_{i}}=-g^{\prime}(Q^{\prime}_{i})-\lambda-\eta f(P_{i})-\xi_{i} =0,\displaystyle=0,
η[U(Q)U(Q)]\displaystyle\eta[U(Q)-U(Q^{\prime})] =0,\displaystyle=0,
i,ξiQi\displaystyle\forall i,\ -\xi_{i}Q^{\prime}_{i} =0.\displaystyle=0.

Since we need Q=QQ^{\prime}=Q to be a solution, so

i,g(Qi)ληf(Pi)ξi\displaystyle\forall i,\ -g^{\prime}(Q_{i})-\lambda-\eta f(P_{i})-\xi_{i} =0,\displaystyle=0,
i,ξiQi\displaystyle\forall i,\ -\xi_{i}Q_{i} =0.\displaystyle=0.

For Qi0Q_{i}\neq 0, there is ξi=0\xi_{i}=0, so Qi=g1[ηf(Pi)λ]Q_{i}=g^{\prime-1}[-\eta f(P_{i})-\lambda]; for Qi=0Q_{i}=0, there is ξi>0\xi_{i}>0, so ηf(Pi)λ>g(0)-\eta f(P_{i})-\lambda>g^{\prime}(0). Denote w=ηw=-\eta and b=λb=-\lambda and combine the two cases together, we get:

Qi=g^1[wf(Pi)+b],w0,Q_{i}=\hat{g}^{\prime-1}[w\cdot f(P_{i})+b],\quad w\leq 0,

where

g^1(x)={g1(x)if x<g(0),0if xg(0).\hat{g}^{\prime-1}(x)=\left\{\begin{array}[]{ll}g^{\prime-1}(x)&\textrm{if $x<g^{\prime}(0)$,}\\ 0&\textrm{if $x\geq g^{\prime}(0)$.}\end{array}\right.

Now we get a necessary condition for QQ to be a Pareto-optimum. To make it sufficient, we still require that for any two distributions satisfying this form, no one could dominate another. This property can be proved by combining conclusion 22 and 33.

D.2 Conclusion 22

We separate the proof into two parts: (1) bb is correspondent to ww; (2) the monotonicity of bb w.r.t. ww.

(1) The sum of all QiQ_{i} should be 11. Denote

T(w,b)=i=1Ng^1[wf(Pi)+b].T(w,b)=\sum_{i=1}^{N}\hat{g}^{\prime-1}[w\cdot f(P_{i})+b].

Since g(x)g^{\prime}(x) is strictly monotonically decreasing, so T(w,b)T(w,b) is monotonically non-increasing w.r.t. bb. If T(w,b)>0T(w,b)>0, there would be a term which is strictly monotonically decreasing w.r.t. bb, under which condition T(w,b)T(w,b) is strictly monotonically decreasing w.r.t. bb. Also, T(w,b)T(w,b) is continuous w.r.t. bb since g1g^{\prime-1} is continuous. When

b=g(0)wf(maxiPi),b=g^{\prime}(0)-w\cdot f(\max_{i}P_{i}),

there is

wf(Pi)+bwf(maxiPi)+b=g(0),w\cdot f(P_{i})+b\geq w\cdot f(\max_{i}P_{i})+b=g^{\prime}(0),

so T(w,b)=0T(w,b)=0; when

b=g(1N)wf(miniPi),b=g^{\prime}(\frac{1}{N})-w\cdot f(\min_{i}P_{i}),

there is

wf(Pi)+bwf(miniPi)+b=g(1N),w\cdot f(P_{i})+b\leq w\cdot f(\min_{i}P_{i})+b=g^{\prime}(\frac{1}{N}),

so T(w,b)1T(w,b)\geq 1. From above analysis, the value of TT can reach 0 or be greater than 11. So combining the monotonicity of TT, there exists and only one bb that satisfies T(w,b)=1T(w,b)=1, leading to a rational distribution.

(2) Define T(w,b)=i=1Ng^1[wf(Pi)+b(w)]T(w,b)=\sum_{i=1}^{N}\hat{g}^{\prime-1}[w\cdot f(P_{i})+b(w)] as above. Since T(w,b)T(w,b) represents the total probability of a distribution, so there should be T(w,b)1T(w,b)\equiv 1, thus dTdw=0\frac{\mathrm{d}T}{\mathrm{d}w}=0.

dTdw=iSf(Pi)+b(w)g′′{g1[wf(Pi)+b(w)]},\displaystyle\frac{\mathrm{d}T}{\mathrm{d}w}=\sum_{i\in S}\frac{f(P_{i})+b^{\prime}(w)}{g^{\prime\prime}\{g^{\prime-1}[w\cdot f(P_{i})+b(w)]\}},

where S={i|wf(Pi)+b(w)<g(0)}S=\{i|w\cdot f(P_{i})+b(w)<g^{\prime}(0)\}. By the condition dTdw=0\frac{\mathrm{d}T}{\mathrm{d}w}=0, we get

b(w)=iSf(Pi)g′′{g1[wf(Pi)+b(w)]}iS1g′′{g1[wf(Pi)+b(w)]}.b^{\prime}(w)=-\frac{\sum_{i\in S}\frac{f(P_{i})}{g^{\prime\prime}\{g^{\prime-1}[w\cdot f(P_{i})+b(w)]\}}}{\sum_{i\in S}\frac{1}{g^{\prime\prime}\{g^{\prime-1}[w\cdot f(P_{i})+b(w)]\}}}.

Since g′′(x)<0g^{\prime\prime}(x)<0, so if f(x)<0f(x)<0 for all x[0,1]x\in[0,1], we can get b(w)>0b^{\prime}(w)>0, thus bb is strictly monotonically increasing w.r.t. ww. Similarly, if f(x)>0f(x)>0 for all x[0,1]x\in[0,1], we can get b(w)<0b^{\prime}(w)<0, thus bb is strictly monotonically decreasing w.r.t. ww.

D.3 Conclusion 33

We also separate the proof into two parts: (1) the uniqueness of Q(w)Q(w); (2) the monotonicity of UU and VV w.r.t. ww.

(1) Since PP is not uniform, so we can denote BB, Pm1P_{m_{1}}, Pm2P_{m_{2}} as they are in the theorem. According to Lemma 1, since Pm1P_{m_{1}} is the largest one, so the corresponding Qm1Q_{m_{1}} is also the largest one, which means

Qm1=g^1[wf(Pm1)+b]>0.Q_{m_{1}}=\hat{g}^{\prime-1}[w\cdot f(P_{m_{1}})+b]>0.

Thus we get

wf(Pm1)+b<g(0).w\cdot f(P_{m_{1}})+b<g^{\prime}(0).

At the same time, because we can get Qi=Qm1Q_{i}=Q_{m_{1}} if Pi=Pm1P_{i}=P_{m_{1}}, so we can sum up all the largest QiQ_{i} and get

MQm1i=1NQi=1,M\cdot Q_{m_{1}}\leq\sum_{i=1}^{N}Q_{i}=1,

we can get

wf(Pm1)+bg(1M).w\cdot f(P_{m_{1}})+b\geq g^{\prime}(\frac{1}{M}). (3)

Consider the case where wBw\geq B, we first prove that wf(Pm2)+bg(0)w\cdot f(P_{m_{2}})+b\leq g^{\prime}(0). Assume

wf(Pm2)+b>g(0),w\cdot f(P_{m_{2}})+b>g^{\prime}(0), (4)

then Qm2=0Q_{m_{2}}=0, and there is Qi=0Q_{i}=0 for any ii satisfying PiPm2P_{i}\leq P_{m_{2}}. As a result, there should be Qi=1MQ_{i}=\frac{1}{M} for all ii satisfying Pi=Pm1P_{i}=P_{m_{1}}, which means

wf(Pm1)+b=g(1M).w\cdot f(P_{m_{1}})+b=g^{\prime}(\frac{1}{M}). (5)

Subtract Equation 5 by Equation 4, we get

w[f(Pm1)f(Pm2)]<g(1M)g(0),w\cdot[f(P_{m_{1}})-f(P_{m_{2}})]<g^{\prime}(\frac{1}{M})-g^{\prime}(0),

so

w<g(1M)g(0)f(Pm1)f(Pm2)=B.w<\frac{g^{\prime}(\frac{1}{M})-g^{\prime}(0)}{f(P_{m_{1}})-f(P_{m_{2}})}=B.

This contradict with the fact that wBw\geq B. Thus we have wf(Pm2)+bg(0)w\cdot f(P_{m_{2}})+b\leq g^{\prime}(0).

Refer to caption
Refer to caption
Figure 4: Illustration of the Pareto-frontier on a random toy categorical distribution with size 2020. The ground truth distribution is under the frontier curve. Left: Pair LL with NRR. Right: Pair CR with SE.

Combining the above conclusions, for any w1,w2[B,0]w_{1},w_{2}\in[B,0], assume Q(w1)=Q(w2)Q(w_{1})=Q(w_{2}), then

w1f(Pm1)+b1\displaystyle w_{1}\cdot f(P_{m_{1}})+b_{1} =w2f(Pm1)+b2,\displaystyle=w_{2}\cdot f(P_{m_{1}})+b_{2},
w1f(Pm2)+b1\displaystyle w_{1}\cdot f(P_{m_{2}})+b_{1} =w2f(Pm2)+b2.\displaystyle=w_{2}\cdot f(P_{m_{2}})+b_{2}.

As Pm1Pm2P_{m_{1}}\neq P_{m_{2}}, so w1=w2w_{1}=w_{2}, causing contradiction. Thus we have Q(w1)Q(w2)Q(w_{1})\neq Q(w_{2}).

For any wBw\leq B, assume

wf(Pm2)+b<g(0).w\cdot f(P_{m_{2}})+b<g^{\prime}(0). (6)

By subtracting Equation 3 and Equation 6, we get

w[f(Pm1)f(Pm2)]>g(1M)g(0),w\cdot[f(P_{m_{1}})-f(P_{m_{2}})]>g^{\prime}(\frac{1}{M})-g^{\prime}(0),

so

w>g(1M)g(0)f(Pm1)f(Pm2)=B.w>\frac{g^{\prime}(\frac{1}{M})-g^{\prime}(0)}{f(P_{m_{1}})-f(P_{m_{2}})}=B.

This causes contradiction, so the above assumption does not hold. Thus we have wf(Pm2)+bg(0)w\cdot f(P_{m_{2}})+b\geq g^{\prime}(0), which means Qm2=0Q_{m_{2}}=0. Borrowing the proof above, we know that Qi=1MQ_{i}=\frac{1}{M} for all ii satisfying Pi=Pm1P_{i}=P_{m_{1}}. This is a trivial Pareto-optimal case where U(Q)=maxQU(Q)U(Q)=\max_{Q}U(Q). Now we know the distribution QQ is fixed and does not change as ww changes, so for any w1,w2Bw_{1},w_{2}\leq B, there is Q(w1)=Q(w2)Q(w_{1})=Q(w_{2}).

(2) For the expression of QiQ_{i}, since ff and gg^{\prime} are both continuous and monotonic, so it is easy to know that QiQ_{i} is continuous w.r.t. ww, then U(Q(w))U(Q(w)) and V(Q(w))V(Q(w)) are both continuous w.r.t. ww. We just need to prove the monotonicity.

Assume Bw1<w20B\leq w_{1}<w_{2}\leq 0, the goal is to prove that U(Q(w1))>U(Q(w2))U(Q(w_{1}))>U(Q(w_{2})) and V(Q(w1))<V(Q(w2))V(Q(w_{1}))<V(Q(w_{2})). According to Lemma 2, w1w_{1} and w2w_{2} have their corresponding α1=w1w11\alpha_{1}=\frac{w_{1}}{w_{1}-1} and α2=w2w21\alpha_{2}=\frac{w_{2}}{w_{2}-1}, and α1>α2\alpha_{1}>\alpha_{2}. Since Q(w)Q(w) is the optimal solution for problem αU(Q)+(1α)V(Q)\alpha U(Q)+(1-\alpha)V(Q), and Q(w1)Q(w_{1}) is different with Q(w2)Q(w_{2}), so the following inequalities hold:

α1U(Q(w1))+\displaystyle\alpha_{1}U(Q(w_{1}))+ (1α1)V(Q(w1))>\displaystyle(1-\alpha_{1})V(Q(w_{1}))>
α1U(Q(w2))+(1α1)V(Q(w2)),\displaystyle\alpha_{1}U(Q(w_{2}))+(1-\alpha_{1})V(Q(w_{2})),
α2U(Q(w1))+\displaystyle\alpha_{2}U(Q(w_{1}))+ (1α2)V(Q(w1))<\displaystyle(1-\alpha_{2})V(Q(w_{1}))<
α2U(Q(w2))+(1α2)V(Q(w2)).\displaystyle\alpha_{2}U(Q(w_{2}))+(1-\alpha_{2})V(Q(w_{2})).

Subtracting the first equation by the second one, we get

[(U(Q(w1))U(Q(w2)))(V(Q(w1))V(Q(w2)))]\displaystyle[(U(Q(w_{1}))-U(Q(w_{2})))-(V(Q(w_{1}))-V(Q(w_{2})))]
(α1α2)>0.\displaystyle\cdot(\alpha_{1}-\alpha_{2})>0.

As α1>α2\alpha_{1}>\alpha_{2}, so

U(Q(w1))U(Q(w2))>V(Q(w1))V(Q(w2)).U(Q(w_{1}))-U(Q(w_{2}))>V(Q(w_{1}))-V(Q(w_{2})).

Because Q(w1)Q(w_{1}) and Q(w2)Q(w_{2}) are both Pareto-optima, there quality and diversity should satisfy one of the following: U(Q(w1))>U(Q(w2)),V(Q(w1))<V(Q(w2))U(Q(w_{1}))>U(Q(w_{2})),V(Q(w_{1}))<V(Q(w_{2})) or U(Q(w1))<U(Q(w2)),V(Q(w1))>V(Q(w2))U(Q(w_{1}))<U(Q(w_{2})),V(Q(w_{1}))>V(Q(w_{2})). With the derived restriction U(Q(w1))U(Q(w2))>V(Q(w1))V(Q(w2))U(Q(w_{1}))-U(Q(w_{2}))>V(Q(w_{1}))-V(Q(w_{2})), we know the first one holds, that is U(Q(w1))>U(Q(w2))U(Q(w_{1}))>U(Q(w_{2})) and V(Q(w1))<V(Q(w2))V(Q(w_{1}))<V(Q(w_{2})).

Refer to caption
Refer to caption
Refer to caption
Figure 5: The optimization curve of Quality-Discrepancy for BLEU-NSBLEU metric pair on synthetic data with different standard deviations, σ=0.5,1.0,2.0\sigma=0.5,1.0,2.0 from left to right.
Refer to caption
Refer to caption
Refer to caption
Figure 6: Evaluation of BLEU-NSBLEU, CR-NRR, and CND on synthetic data with σ=1.0\sigma=1.0. Test models are Pareto-optima parameterized by β\beta under LL-SE metric pair.

Appendix E Proof of Theorem 33

The requirement that Q=PQ=P being a Pareto-optimum is equivalent to the following condition: for any PP, there exist w00w_{0}\leq 0 and b0b_{0} that for any ii, there is

Pi=g^1[w0f(Pi)+b0].P_{i}=\hat{g}^{\prime-1}[w_{0}\cdot f(P_{i})+b_{0}].

This means, for any Pi>0P_{i}>0, there is w0f(Pi)+b0=g(Pi)w_{0}\cdot f(P_{i})+b_{0}=g^{\prime}(P_{i}). Since ff and gg^{\prime} are both continuous, so

w0f(0)+b0g(0)=limPi0w0f(Pi)+b0g(Pi)=0.w_{0}\cdot f(0)+b_{0}-g^{\prime}(0)=\lim_{P_{i}\to 0}w_{0}\cdot f(P_{i})+b_{0}-g^{\prime}(P_{i})=0.

We can see w0f(Pi)+b0=g(Pi)w_{0}\cdot f(P_{i})+b_{0}=g^{\prime}(P_{i}) is also true for Pi=0P_{i}=0. By solving this differential equation, we get

g(x)=w00xf(u)du+b0x.g(x)=w_{0}\int_{0}^{x}f(u)\mathrm{d}u+b_{0}x.

Here b0b_{0} can be any value because Pi=g1[w0f(Pi)+b0]P_{i}=g^{\prime-1}[w_{0}\cdot f(P_{i})+b_{0}] always lead to a plausible distribution PP. Under this condition, we know that Q=PQ=P is the only distribution that maximize Ψ(Q)=αU(Q)+(1α)V(Q)\Psi(Q)=\alpha U(Q)+(1-\alpha)V(Q) where α=w0w01\alpha=\frac{w_{0}}{w_{0}-1} according to Lemma 2. With above conclusions, it is easy to check that D(P||Q)=Ψ(P)Ψ(Q)0D(P||Q)=\Psi(P)-\Psi(Q)\geq 0 and D(P||Q)=0D(P||Q)=0 if and only if Q=PQ=P, thus D(P||Q)D(P||Q) is a divergence metric.

Appendix F Pareto-frontier with Mismatched Metrics

We show in Figure 4 that the point Q=PQ=P is under the Pareto-frontier curve when quality and diversity metrics are not matched, i.e. the condition in Theorem 33 is not satisfied. We use the same toy dataset, but pair LL with NRR and CR with SE. Note that there is always a gap between the star and the curve, indicating that the real distribution lies on neither of the two Pareto-frontiers.

Appendix G Additional Information for Experiments

G.1 Experiments on Synthetic Data

The probabilities of synthetic ground truth distributions are shown in Figure 7. We use different standard deviations to get different kind of distributions. Distribution with σ=0.5\sigma=0.5 is more flat and of higher entropy, and distribution with σ=2.0\sigma=2.0 is more sharp and of lower entropy.

We show the training curve of the optimization process used on synthetic data in Figure 5. Learning rates are adjusted according to each process, so as to find a best distribution. Points are neglected if V(Q)V(P)V(Q)\leq V(P) or U(Q)U(P)U(Q)\leq U(P), i.e. they fail to dominate the ground truth distribution.

We show the correlation between CR/NRR/CND and quality/diversity/divergence on synthetic data, respectively. We use the well-defined Pareto-frontier under LL-SE in text space as target models, i.e. QiPiβQ_{i}\propto P_{i}^{\beta}. As β\beta decreases, the corresponding Pareto-optimum becomes more close to uniform distribution, so that quality decreases and diversity increases according to Theorem 22, and minimal divergence is taken when β=1\beta=1 according to Theorem 33. We plot the curves of BLEU-NSBLEU, CR-NRR, and CND in Figure 6. We can see CR/NRR/CND can properly reflect quality/diversity/divergence, respectively.

G.2 Experiments on Real Text Data

Refer to caption
Figure 7: The log-probabilities of three synthetic ground truth distributions used in our experiments, shown in descending order.

For MSCOCO dataset, we remove words with frequency lower than 20, as well as sentences containing them. The vocabulary size is 5,473, and maximum text length is 32. Sentences longer than 32 are also removed, and we get a total number of 530,093 sentences. We randomly sample 50,000 sentences as candidate set, 50,000 sentences as reference set, and another 200,000 sentences for training data of the RNNLM.

For WMT dataset, we use the Europarl-v7 part. We remove words with frequency lower than 400, as well as sentences containing them. The vocabulary size is 6,655, and maximum text length is 50. Sentences longer than 50 or shorter than 20 are also removed, and we get a total number of 475,662 sentences. We again randomly sample 50,000 sentences as candidate set, 50,000 sentences as reference set, and another 200,000 sentences for training data of the RNNLM.