Generating Long Financial Report using Conditional Variational Autoencoders with Knowledge Distillation

Yunpeng Ren^1,2, Ziao Wang¹, Yiyuan Wang¹, Xiaofeng Zhang¹

Abstract

Automatically generating financial report from a piece of news is quite a challenging task. Apparently, the difficulty of this task lies in the lack of sufficient background knowledge to effectively generate long financial report. To address this issue, this paper proposes the conditional variational autoencoders (CVAE) based approach which distills external knowledge from a corpus of news-report data. Particularly, we choose Bi-GRU as the encoder and decoder component of CVAE, and learn the latent variable distribution from input news. A higher level latent variable distribution is learnt from a corpus set of news-report data, respectively extracted for each input news, to provide background knowledge to previously learnt latent variable distribution. Then, a teacher-student network is employed to distill knowledge to refine the output of the decoder component. To evaluate the model performance of the proposed approach, extensive experiments are preformed on a public dataset and two widely adopted evaluation criteria, i.e., BLEU and ROUGE, are chosen in the experiment. The promising experimental results demonstrate that the proposed approach is superior to the rest compared methods.

Introduction

Text generation (Liao et al. 2018; Duan et al. 2019) has long been investigated in the domain of natural language processing with flourishing results (Feng et al. 2018; Serban et al. 2017; Mager et al. 2020; Hossain, Ghazvininejad, and Zettlemoyer 2020; Wang et al. 2019). Generally, text generation task could have several sub problems, e.g., long text generation (Guo et al. 2017; Bahdanau, Cho, and Bengio 2014; Shen et al. 2019; Dai et al. 2019) and summary generation (Gao, Zhao, and Eger 2020; Genest and Lapalme 2011; Liu and Lapata 2019; Cao et al. 2018). Among these sub problems, long text generation from short text is quite challenging especially in a domain-specific task, such as financial report generation. Particularly, the difficulty to generate financial reports given a piece of short news lies in the lack of sufficient information, especially the financial background knowledge, and thus this challenging task is seldom addressed by existing work.

Prior works. In the literature, there exist a good number of recurrent neural network (RNN) based text generation approaches (Feng et al. 2018; Sorodoc, Gulordava, and Boleda 2020; Xu et al. 2020a) although not for long text generation (Guo et al. 2017). Generally, these approaches can be classified into two categories, i.e., deep generative model based approaches and generative adversarial network (GAN) based approaches. As for the generative model based approaches, variational autoencoder (VAE) approach is widely adapted by many researchers for this task (Kingma and Welling 2013; McCarthy et al. 2020; Wang and Wan 2019). In these VAE based approaches, LSTM or GRU-like component is usually chosen as the encoder or decoder component. Also, the conditional variational autoencoder (Yang et al. 2018) with a hybrid decoder adding the deconvolutional neural networks was proposed to learn topics to generate Chinese poems. Particularly, (Shen et al. 2019) designed a hierarchy of stochastic layers between the encoder and decoder component to learn a VAE model for generating long coherent text. For GAN based approaches, a good number of research attempts (Yang et al. 2019a; Guo et al. 2017; Zhang et al. 2020) have been made towards generating high-quality long text. To incorporate background knowledge, (Yang et al. 2019a) extracts knowledge from external base to integrate with the generator through a dynamic memory mechanism, and the model is adversarial trained with a multi-classifier as the discriminator.

However, there exist two challenges which invalidate most existing approaches. First, the length of the input eventual news is rather short, e.g., “The coronavirus recession struck swiftly and violently. Now, with the US economy still in the grip of the outbreak five months later, the recovery looks fitful and uneven and painfully slow”. But the length of the generated reports is usually much greater than that of the input news, and apparently it is quite challenging to generate reasonable reports. Second, the generation of financial reports by human specialists often involves their intellectual efforts, such as inferring and reasoning abilities. Till now, these challenges remain unresolved and need more and more research efforts.

Refer to caption — Figure 1: An illustrating example of the proposed approach.

To address aforementioned issues, we propose this novel conditional variational autoencoders based approach with knowledge distillation, called CVAE-KD. An illustrating example of the proposed approach is plotted in Figure 1. In the proposed approach, a carefully designed teacher-student network structure is integrated with the CVAE model to simultaneously resolve the information loss as well as the knowledge reasoning issues. To encode input news data, a Bi-GRU component is adopted and the latent variable distribution is learnt on each batch of input news data. Obviously, the learnt latent variable $z_{1}$ does not contain sufficient information to decode a high quality report. To provide sufficient information as well as to simulate human’s inference knowledge, a higher level latent variable distribution $z_{2}$ is learnt from a corpus of historical news-report data. To embed these data, a pre-trained component ELMO (Peters et al. 2018) is adopted and then $z_{2}$ is estimated using the feature embeddings of aggregated similar news data. During model learning, $z_{1}$ is forced to approximate $z_{2}$ by a designed KL-divergence term. For decoder component, a GRU component is adopted. To further refine the output reports, a set of financial reports of similar news are embedded through the same ELMO component, and are then treated as a teacher. The output of decoder component is treated as a student component. With this knowledge distillation step, the corresponding financial report for a piece of input news is already generated.

The main contributions of this paper are summarized as follows.

•

We propose the conditional variational autoencoders (CVAE) based approach with knowledge distillation to generate financial reports. To the best of our knowledge, this is among the first attempts to resolve this challenging task.
•

We employ a pre-trained model as a teacher network and the student component is to learn background knowledge from external knowledge base. The corresponding KL-divergence loss is designed for this purpose.
•

Extensive experiments are evaluated on one real-world dataset. The promising experimental results have demonstrated the superiority of the proposed approach over the compared baseline and the state-of-the-art approaches.

The Proposed Approach

Preliminaries and Problem Setup

Let $X$ denote a corpus of news data with each $x=(x_{1},...,x_{M})$ containing $M$ tokens, $Y$ denote the set of generated reports and each $y=(y_{1},...,y_{N})$ containing $N$ tokens, where $N$ is significantly greater than $M$ . The financial report generation problem could be formulated as

p(y_{1},y_{2},...,y_{N})=\prod_{k=1}^{N}p(y_{k}|y_{1:k-1};x).

That is, given a piece of news $x$ , together with the generated $k-1$ tokens $y_{1:k-1}$ , we predict the probability of the next token to be generated.

The proposed CVAE-KD

The framework of the proposed CVAE-KD model is plotted in Figure 2. In this approach, the input news $X$ is first embedded using lookup table. Then, Bi-GRU is adopted to encode each input $x$ , and the latent variable distribution is learnt. For the background knowledge learning, a set of similar news are extracted as well as the corresponding reports to estimate a higher level latent variable distribution. At last, a GRU component is employed to decode the output report. The proposed CVAE-KD has three components, i.e., encoder component, background knowledge extraction component, decoder component, and we detail them in the following subsections.

Encoder Component

We adopt a Bi-GRU component for the encoder component. The input of this component is a piece of news $x$ . First, $x$ is embedded using lookup table and then is fed into the Bi-GRU component. As usual, we only choose the hidden state $h_{t}^{e}$ of Bi-GRU as the output of this component, the general calculation of Bi-GRU are given as

$\displaystyle c_{t}$	$\displaystyle=$	$\displaystyle\sigma(W_{c}x+U_{c}h_{t-1}^{ex})$
$\displaystyle r_{t}$	$\displaystyle=$	$\displaystyle\sigma(W_{r}x+U_{r}h_{t-1}^{ex})$
$\displaystyle h_{t}^{x^{`}}$	$\displaystyle=$	$\displaystyle tanh(Wx+U(r_{t}h_{t-1}^{ex})$
$\displaystyle h_{t}^{xe}$	$\displaystyle=$	$\displaystyle(1-c_{t})h_{t-1}^{ex}+c_{t}h_{t}^{x^{`}}$
$\displaystyle h^{e}$	$\displaystyle=$	$\displaystyle[\overrightarrow{h_{t}^{xe}},\overleftarrow{h_{t}^{xe}}],$

where $c_{t}$ is the update gate, $r_{t}$ is the reset gate, $h_{t-1}^{ex}$ is the previous activation, $h_{t}^{x^{`}}$ is the candidate activation, $h_{t}^{xe}$ is the current activation and $h^{e}$ is the concatenation of activation from both direction.

After embedding news data into feature vectors. We assume the latent variables $z$ of CVAE follow a gaussian distribution. Then, two MLPs are respectively applied to learn parameters of the latent variable distribution, i.e. mean $\mu$ and standard deviation $\sigma$ , and they are calculated as

	$\displaystyle\mu_{x}$	$\displaystyle=$	$\displaystyle f_{\mu_{x}}(h_{t}^{e})$
	$\displaystyle\sigma_{x}$	$\displaystyle=$	$\displaystyle f_{\sigma_{x}}(h_{t}^{e}).$

With the learnt latent variable distribution, we can then sample latent variable $z_{1}$ from it, written as

z_{1}^{i}=\mu_{x}^{i}+\sigma_{x}^{i}\epsilon,\epsilon\sim N(0,I)/N(\mu,\sigma).

Background Knowledge Extraction and Distillation

Learning latent variable from background Knowledge

In this component, we first build external knowledge base for each input $x$ . To this end, we apply the standard KNN algorithm to group a subset $X_{s}$ of news which are most similar to $x$ . Then, we extract the set $Y_{s}$ of corresponding financial reports.

Similar to the previous steps, we also first lookup $X_{s}$ , $Y_{s}$ and then embed them using Bi-GRU component. Then, another gaussian distribution is assumed for this background knowledge base with the parameters are estimated as

\mu_{X_{em};Y_{em}}=f_{\mu_{x}}([X_{em};Y_{em}])

\sigma_{X;Y}=f_{\sigma_{x}}([X_{em};Y_{em}])

z_{2}^{i}=\mu_{X;Y}^{i}+\sigma_{X;Y}^{i}\epsilon_{X;Y},\epsilon_{X;Y}\sim N(0,I)/N(\mu,\sigma)

where $X_{k}$ and $Y_{k}$ are feature representations globally learnt, $X_{em}$ and $Y_{em}$ feature representations of each input pair of data (news-report), $z_{2}$ is the latent variable sampled from the learnt gaussian distribution.

Knowledge distillation to supervise report generation

To distill knowledge from previous extracted external knowledge base, the extracted subsets $X_{s}$ and $Y_{s}$ are first embedded using a pre-trained model ELMO. Its output are denoted as $ELMO_{k}^{task}$ . Without loss of generality, we concatenate these ELMO embeddings with the original feature representations, and thus the output of this module can be written as

	$\displaystyle X_{em}=[X_{k};ELMO_{k}^{task}]$
	$\displaystyle Y_{em}=[Y_{k};ELMO_{k}^{task}].$

To align the output of this component with the vocabulary, a two-layers MLP is adopted. Then, $Y_{em}$ is considered as the teacher to supervise the generation of financial report $y$ given $x$ .

Accordingly, the student is the output of decoder component during model training process. Given the output of decoder, i.e., $y=(y_{1},y_{2},...,y_{N})$ , the employed ELMO is to predict probability of next token to be generated given a sequence of generated tokens in $y$ , written as

p(y_{1},y_{2},...,y_{N})=\prod_{k=1}^{N}p(y_{k}|y_{1},y_{2},...,y_{k-1})

p(y_{1},y_{2},...,y_{N})=\prod_{k=1}^{N}p(y_{k}|y_{k+1},y_{k+2},...,y_{N}).

The knowledge distillation is optimized according to following objective function, given as

L_{kd}(\theta)=-\sum_{w\in V}[P_{\phi}(y_{t}=\omega|x,y)logP_{\theta}(y_{t}=\omega|x,y_{1:t-1})]

where $P_{\phi}(y_{t})$ is the soft target of ELMO, $\phi$ denotes the parameters of ELMO, and $V$ denotes the output vocabulary.

Decoder Component

To generate the report $y$ , a GRU is chosen as the decoder component. The output of GRU is used as the output of this component, given as

h_{t}^{s}=GRU(z_{1}).

This output is then input to a MLP to generate the probability of word in adopted vocabulary, given as

p(y_{k})=softmax(tanh(W_{m}h_{t}^{s}+b)).

The word with the maximum probability value is chosen as the final output at each iteration.

Model Loss

The overall loss function of the proposed CVAE-KD contains two terms, i.e., CVAE loss and knowledge distillation loss. Each loss item is respectively illustrated as follows.

The CVAE loss contains two part, the first of which is to calculate the reconstruction loss of the model and the second is to force the latent variable $z_{1}$ , learnt from input news $x$ , to approximate the latent variable $z_{2}$ , learnt from background knowledge base, written as

	$\displaystyle L_{CVAE}(x,y;\theta,\phi)=-D_{KL}[q_{\varphi}(z_{2}\|X,Y)\|\|p_{\theta}(z_{1}\|x)]$
	$\displaystyle+E_{q_{\phi}(z_{1}\|x,z_{2})}[logp_{\theta}(y\|z_{1},x)]$

where $p_{\theta}(z_{1}|x)$ , $q_{\phi}(z_{1}|x,z_{2})$ and $q_{\varphi}(z_{2}|X,Y)$ are respectively calculated as

q_{\phi}(z_{1}|x,z_{2})=N(z_{1};\mu_{x},\mu_{z_{2}},\sigma_{x},\sigma_{z_{2}})

p_{\theta}(z_{1}|x)=N(z_{1};\mu_{x},\sigma_{x}).

q_{\varphi}(z_{2}|X,Y)=N(z_{2};\mu_{X,Y},\sigma_{X,Y}).

As the knowledge distillation loss is given in Eq. Knowledge distillation to supervise report generation, the overall model loss of the proposed CVAE-KD can be written as

L_{total}=\alpha L_{CVAE}+(1-\alpha)L_{kd}

where $\alpha$ is a learnable parameter. The model is then optimized using Adam algorithm (Kingma and Ba 2014).

Methods

BLEU-1

BLEU-2

BLEU-3

BLEU-4

Seq2seq	32.69	7.65	4.85	2.75
Seq2seq+Attn	33.64	13.85	9.89	6.92
Pointer-Generator	36.45	9.51	5.75	2.45
CVAE	33.5	14.07	10.04	6.97

CVAE-KD

46.67

20.32

12.81

8.00

Table 1: Evaluation results of all compared methods with respect to BLEU criteria.

Methods

ROUGE-1

ROUGE-2

ROUGE-L

Seq2seq	8.46	1.30	3.59
Seq2seq+Attn	15.66	3.02	3.89
Pointer-Generator	12.08	1.40	3.44
CVAE	16.69	3.32	4.65

CVAE-KD

18.27

2.64

6.95

Table 2: Evaluation results of all compared methods with respect to ROUGE criteria.

Experiments

In this section, we first illustrate how the dataset is prepared as well as the evaluation criteria. Then, several baseline models as well as the SOTA approaches are introduced. At last, we perform extensive experiments on one real-world dataset to answer following research questions:

•

RQ1: Does the proposed approach outperform the state-of-the-art approaches for long text generation, i.e., financial reports, given a piece of short news?
•

RQ2: What is the model performance when generating reports with different size?
•

RQ3: How is the quality of generated financial reports?
•

RQ4: Whether the proposed knowledge distillation component could affect model performance (ablation study)?

Dataset and Evaluation Criteria

Dataset Preparation

To evaluate model performance, we crawled financial news as well as the corresponding reports from these famous Chinese financial Websites (Sina Finance¹¹1http://stock.finance.sina.com.cn/stock/, Tonghuashun Finance²²2http://m.10jqka.com.cn/ybnews/ and Eastmoney³³3http://data.eastmoney.com/report/). The raw dataset contains 10,706 pairs of news-report data and each piece of news is associated with a financial report.

Data Pre-processing

An open source tool (“jieba”) ⁴⁴4https://github.com/fxsjy/jieba is first adopted to segment the Chinese news. After word segmentation, the average length of the financial news and the financial reports are 28 and 331 words, respectively. Then, we screen out the word if its term frequency (TF) is higher than 5, and the new vocabulary set contains 17,210 words. At last, we replaced the numeric symbols with number token (NUM). Furthermore, the vocabulary is marked with four other tokens, i.e., padding token (PAD), unknown token (UNK), start position token (START) and end position token (END).

Evaluation criteria

The Bilingual Evaluation Understudy (BLEU) and ROUGE are chosen as the evaluation criteria. As most Chinese phases consist of less than five words, we chose the BLEU-1, BLEU-2, BLEU-3 and BLEU-4 scores as the detailed evaluation measurements. Similarly, we chose ROUGE-1, ROUGE-2 and ROUGE-L scores as the ROUGE-type evaluate criteria.

Baseline Models

To evaluate the model performance, several baseline and the state-of-the-art approaches, i.e., Seq2Seq, Seq2Seq+Attn, Pointer-Generator network and CVAE model are chosen in the experiments for performance comparison. Details of each compared approach are illustrated as follows.

•

Seq2Seq (Sutskever, Vinyals, and Le 2014) is considered as a baseline model for text generation task which already achieves a superior model performance in various text-to-text generation problem.
•

Seq2Seq+Attn (Bahdanau, Cho, and Bengio 2014) extends the original Seq2Seq by allowing the soft-search for relevant words, from the input source sentences, to predict the next word to be generated.
•

Pointer-Generator network (See, Liu, and Manning 2017) is considered as the state-of-the-art model proposed for sampling words from the input source sentences via the pointing process, and then it integrates the coverage mechanism to penalize the generation of repetitive words.
•

CVAE (Zhao, Zhao, and Eskenazi 2017) is most related to our proposed approach. This model captures the discourse-level diversity in the encoder and uses greedy decoders to generate diverse responses. And the latent variables are used for learning the distribution over potential conversational intentions.

Experimental Settings

Network Parameters

The parameters of designed neural network model are designed as follows. For the employed GRU component, the number of hidden units in the GRU component is set to 256. The dimension of the hidden variables is set to 16. The dropout rate for decoder is set to 0.5 to avoid over-fitting problem. The learning rate is set to 0.001, the batch size is set to 16 and the weight for knowledge distilling loss is set to 1.

Experimental Settings

In the experiments, the maximum length of encoder is set to 30. To generate reports of different length, the length of decoder is respectively set to 100, 150 and 200 to evaluate the model performance of the proposed CVAE-KD. Then, we evaluate all models and report the experimental results in following subsections.

Results on generating financial reports (RQ1)

In this experiment, we evaluate all approaches and the corresponding BLEU and ROUGE scores are reported in Table 1 and Table 2, respectively.

From Table 1, it is obvious that the proposed CVAE-KD is the best model. The corresponding BLEU-1, BLEU-2, BLEU-3 and BLEU-4 scores of CVAE-KD are 28.0 $\%$ , 44.4 $\%$ , 27.6 $\%$ , 14.8 $\%$ higher than the second best scores, respectively. Except for BLEU-1 indicator, the CVAE is the second best model which partially verifies that the effectiveness of the proposed CVAE-KD over the CVAE model. It is also noticed that the Pointer-Generator (PG) network is the worst model. The possible reason is that this PG network is for summarization task from long text to short text, and it might not suit for our problem. We also observed that when more terms are generated together, the model performance, e.g., BLEU-4, of all approaches decrease. This is consistent with our commonsense that it is more challenging to generate a longer phrase.

From the evaluation results in Table 2, similar observations could be found. First, the proposed CVAE-KD achieves the best results on all criteria except for ROUGE-2 metric. The scores of ROUGE-1 and ROUGE-L of our model are respectively 9.5 $\%$ , 49.5 $\%$ higher than that of the CVAE model. It is noticed from both tables that the original CVAE outperforms Seq2seq and Seq2seq+Attn. Apparently, the CVAE well fits for this problem as the CVAE-type models are generative based approaches which can decode a high-dimensional output data using the learnt low-dimensional latent variable. Furthermore, with the knowledge distillation and the employed pre-trained model, the proposed CVAE-KD could further enhance the model performance, and this verifies the effectiveness of the proposed approach.

How the size of financial reports affect model performance (RQ2)

This experiment is to evaluate what is the model performance of the proposed approach when generating financial reports of different length. To recall that, we empirically set the length of the generated reports to 100, 150 and 200, respectively. We expect that the model performance of CVAE-KD will gradually decrease when generating longer reports. The corresponding experimental results are reported in Table 3 and Table 4, respectively.

For simplicity reason, we denote report(200) as the generated report with its length as 200 words, and similarly we have report(100) and report(150). For the BLEU evaluation criterion, as shown in Table 3, it is noticed that the BLEU-1, BLEU-2, BLEU-3 and BLEU-4 scores of report(100) are 2.5 $\%$ , 4.8 $\%$ , 9.1 $\%$ , 13.2 $\%$ higher than that of report(150) which is the second best model. And the report(200) is the worst model w.r.t. all BLEU criteria. In addition, we also noticed that the BLEU-1, BLEU-2, BLEU-3 and BLEU-4 scores of all reports gradually decrease which is consistent with our expectation that the model might not be accurate enough to generate a longer phrase.

Similarly for the ROUGE results, as shown in Table 4, we found that the ROGUE-1, ROUGE-2 and ROUGE-L scores of report(100) are 4.4 $\%$ , 12.9 $\%$ and 9.9 $\%$ higher than that of report(200) which is the worst model, and the ROUGE score also gradually decreases from ROUGE-1 to ROUGE-L. From these observations, we can conclude it is quite a difficult task to generate long text from a short text. The shorter the report length, the better the effect of the generating text. These objective evaluation results partially verify the effectiveness of the proposed knowledge distillation based approach. The following experiment will evaluate the subjective performance of the proposed approach.

Report Length

BLEU-1

BLEU-2

BLEU-3

BLEU-4

100	51.06	24.10	14.53	8.66
150	49.83	23.00	13.32	7.65
200	46.67	20.32	12.81	8.00

Table 3: The BLEU results of the proposed CAVE-KD on generating reports of different length.

Report Length

ROUGE-1

ROUGE-2

ROUGE-L

100	19.08	2.98	7.64
150	18.59	2.91	7.55
200	18.27	2.64	6.95

Table 4: The ROUGE results of the proposed CAVE-KD on generating reports of different length.

A case study of generating financial report (RQ3)

To evaluate the quality of the generated reports, we choose a pair of news-report data, and report the corresponding results by CVAE and CVAE-KD in Table 5. Note that due to page limit, the length of generated reports in this table is set to 200. We translated the generated Chinese reports into English via google translator and highlight the correct words in blue color.

From the generated results, we have following observations. First, the hit rate of CVAE-KD-report is higher than that of CVAE-report, which means that the CVAE-KD is more accurate than CVAE. Second, the generated reports by CVAE-KD is more coherent to the target report. For example, “The growth rate of manufacturing investment has fallen” generated by CVAE-KD-report is more accurate than “the US inflationary pressure” generated by CVAE. These two observations verify that the quality of the generated reports by CVAE-KD is satisfying. However, it is also noticed that there still exist some repeated words or sentences in the generated CVAE-KD-report, e.g. “We believe that we believe that we will” in the last sentence, which needs further research efforts.

Input News	From the perspective of the U.S.’s own situation, the U.S. economic prospects are improving but the economic rebound will be less than expected.Private consumption in the U.S. depends on employment conditions and wage growth. When the number of employees is at a high level, the unemployment rate has dropped.At the same time, weaker U.S. inflation will further slow down wage growth.
Target Report	Therefore, subject to wage growth and declining inflation, the recovery of private consumption in the United States will be less than expected.Secondly, in terms of private investment, US housing inventories are at a low level, and housing prices continue to rise, which will stimulate the rapid growth of new housing construction and real estate investment; but another On the one hand, from the perspective of the inventory cycle, the U.S. manufacturing inventory replenishment is nearing completion. It is expected that the monthly inventory cycle will peak and fall, which will drag down manufacturing investment and therefore the U.S. fixed asset investment in the third quarter. Performance will be difficult to regain the strong first quarter.
CVAE-report	”We believe that,, of, month month day forecast; ())) month forecast; ((,,, and economic growth expectations and economic The growth rate fell back to—The economic growth rate fell back to the US dollar index, which was a month-on-month decline; the US dollar index was a month-on-month decline and the US dollar index was a quarter-on-quarter decline, and the year-on-year growth rate fell, but it will continue in the future. the US inflationary pressure; the market-to-market-to-market-to-China policy policy policy is also the government debt crisis risk appetite under downward pressure downward is also the government To a certain extent.
CVAE-KD-report	In the economic data, the economic growth month since the meeting is expected. But the US economic data in the United States , the US government, investment and market expectations; while manufacturing investment growth rate down. The growth rate of manufacturing investment has fallen. The growth rate of manufacturing is in line with ”market expectations and” policies, etc. Under the policy: ”Under the policy: The central bank’s meeting on the market’s inflation in the middle of last month: This will be the index to increase inflation. Interest will be the main reason. The company will become a global enterprise and global enterprise field in the future. We believe that we believe that we will also have a global enterprise field in the future.

Table 5: The input news and the corresponding generated reports.

Results on ablation study (RQ4)

Note that the proposed model contains several components. However, the most important one is the knowledge distillation (KD) component. To investigate whether the proposed knowledge distillation component works or not, we perform this experiment by removing the KD component and revising the model loss function. The comparison results are recorded in Table 6.

From this table, it is noticed that the model performance of CVAE with KD component is higher than that of the CVAE without KD component, especially for the BLEU scores. For BLEU-1, BLEU-2, BLEU-3 and BLEU-4 criteria, the score of the CVAE-KD is 9.9 $\%$ , 28.9 $\%$ , 44.4 $\%$ and 64.6 $\%$ higher than that of the CVAE without KD. Similarly, the ROUGE-1, ROUGE-2 and ROUGE-L score of the CVAE-KD is 18.8 $\%$ , 43.5 $\%$ and 1.6 $\%$ higher than that of the CVAE without KD, respectively. These results verifies that the proposed CVAE-KD could capture external knowledge to help the generation of financial reports to some extent.

Methods

BLEU-1

BLEU-2

BLEU-3

BLEU-4

ROUGE-1

ROUGE-L

CVAE-KD (without KD)	42.46	15.76	8.87	4.86	15.38	1.84	6.84
CVAE-KD	46.67	20.32	12.81	8.00	18.27	2.64	6.95

Table 6: Results of the ablation study.

Related Work

Existing text-to-text generation approaches could be classified into three categories, i.e., sequence-to-sequence based approaches, variational autoencoder based approaches and generative adversarial network based approaches, and we review these approaches in the following subsections.

Sequence-to-sequence based approaches

In the literature, sequence-to-sequence based approaches have achieved superior model performance in various text to text generation tasks. Cho et al.(Cho et al. 2014) proposed a neural network architecture with RNNs as a sequence of encoding and decoding component. The encoder embeds the input sequence of variable-length into a fixed-length feature vector, then the decoder maps the vector back to the target sequence of variable-length. To generate long text, a Seq2Seq+Attn model (Bahdanau, Cho, and Bengio 2014) is proposed which allows to search relevant word from the source sentences. Feng et al. (Feng et al. 2018) propose a multi-topic-aware long short-term memory (MTA-LSTM) network to generate a paragraph-level Chinese essay. The CopyNet model (Gu et al. 2016) incorporates the copying mechanism into the learning process of Seq2Seq model which achieves a better model performance. Similar attention based approaches could be seen in (Dong et al. 2017). Recently, various transformer based approaches (Xu et al. 2020b; Koncel-Kedziorski et al. 2019; Keskar et al. 2019) or pre-trained models (Du et al. 2020; Song et al. 2019; Dong et al. 2019; Yang et al. 2019b) are proposed and have achieved the SOTA performance in related tasks. And the financial report generation issue is first raised in (Hu, Zhang, and Yang 2019). In their work, a two-stage hybrid deep learning model is proposed to generate macro research reports from a piece of breaking news.

Variational autoencoder based approaches

The variational autoencoder (VAE) models are also widely seen in text generation task. For instance, (Bowman et al. 2015) proposes a RNN-based VAE model which learns the feature representations of latent variables at the sentence level. The proposed model can explicitly represent the holistic properties of sentences such as style, topic, and high-level syntactic features. An inference network (Miao, Yu, and Blunsom 2016) is proposed to apply on the discrete input to estimate the variational distribution. The authors (Miao and Blunsom 2016) further model the input text as a discrete latent variable under the variational auto-encoding framework. Then, a neural network-based generative architecture with latent stochastic variables (Serban et al. 2016) is proposed to generate diverse text. To generate a long sequence of text, a hybrid architecture (Semeniuta, Severyn, and Barth 2017) is proposed to interweave the feed-forward convolutional and deconvolutional components. The conditional VAE model (Wang and Wan 2019) is considered as the state-of-the-art approach in this task. The CVAE employs a shared attention layer for both encoder and decoder, and this this is able to learn better feature representations of coherent sentences. Later, a multi-pass hierarchical CVAE (Yu et al. 2020) is proposed for automatic storytelling. Note that the CVAE-type models are generally resolved through the ELBO optimization (McCarthy et al. 2020).

Generative adversarial network based approaches

The original generative adversarial network (GAN) (Goodfellow et al. 2014) is already widely adapted to various research problems. As the original GAN cannot model discrete variables, (Kusner and Hernández-Lobato 2016) proposes to employ Gumbel-softmax distribution for this issue. To generate text, a LSTM module (Zhang, Gan, and Carin 2016) or GRU module (Zhu et al. 2018) is commonly adopted as the generator component. By modeling the model generator as a stochastic process (Yu et al. 2017), SeqGAN is proposed and is updated using a gradient policy rule. RankGAN(Lin et al. 2017) is then proposed to generate high-quality textual descriptions by revising the discriminator as a classifier. LeakGAN (Guo et al. 2017) is further proposed to generate long text within 40 words. To reduce labeling cost, (Croce, Castellucci, and Basili 2020) proposes the GAN-BERT which extends the BERT-like architecture for modeling unlabeled data in a generative adversarial setting.

Conclusion

In this paper, we propose the conditional variational autoencoders based approach with knowledge distillation (CVAE-KD) to automatically generate long financial reports from a piece of short news. Particularly, a higher level latent variable is learnt from the background knowledge base respectively extracted for each input data. We then force the latent variable of CVAE to approximate the higher level latent variable. At last, the knowledge distillation component is designed which takes the output of the pre-trained model as a teacher to better supervise the generation of financial reports. Extensive experiments are preformed on a public dataset to evaluate the model performance. The experimental results demonstrate that the proposed approach could achieve the state-of-the-art model performance against the compared approaches.

References

Bahdanau, Cho, and Bengio (2014) Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv: Computation and Language .
Bowman et al. (2015) Bowman, S. R.; Vilnis, L.; Vinyals, O.; Dai, A. M.; and Bengio, S. 2015. Generating Sentences from a Continuous Space. Computer ence .
Cao et al. (2018) Cao, Z.; Li, W.; Li, S.; and Wei, F. 2018. Retrieve, Rerank and Rewrite: Soft Template Based Neural Summarization. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 152–161.
Cho et al. (2014) Cho, K.; Van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv: Computation and Language .
Croce, Castellucci, and Basili (2020) Croce, D.; Castellucci, G.; and Basili, R. 2020. GAN-BERT: Generative Adversarial Learning for Robust Text Classification with a Bunch of Labeled Examples. In ACL 2020: 58th annual meeting of the Association for Computational Linguistics, 2114–2119.
Dai et al. (2019) Dai, Z.; Yang, Z.; Yang, Y.; Carbonell, J. G.; Le, Q.; and Salakhutdinov, R. 2019. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2978–2988.
Dong et al. (2017) Dong, L.; Huang, S.; Wei, F.; Lapata, M.; Zhou, M.; and Xu, K. 2017. Learning to Generate Product Reviews from Attributes 1: 623–632.
Dong et al. (2019) Dong, L.; Yang, N.; Wang, W.; Wei, F.; Liu, X.; Wang, Y.; Gao, J.; Zhou, M.; and Hon, H.-W. 2019. Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems, 13063–13075.
Du et al. (2020) Du, C.; Sun, H.; Wang, J.; Qi, Q.; and Liao, J. 2020. Adversarial and Domain-Aware BERT for Cross-Domain Sentiment Analysis. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 4019–4028.
Duan et al. (2019) Duan, Y.; Pei, J.; Xu, C.; and Li, C. 2019. Pre-train and Plug-in: Flexible Conditional Text Generation with Variational Auto-Encoders. arXiv: Computation and Language .
Feng et al. (2018) Feng, X.; Liu, M.; Liu, J.; Qin, B.; Sun, Y.; and Liu, T. 2018. Topic-to-Essay Generation with Neural Networks. In IJCAI, 4078–4084.
Gao, Zhao, and Eger (2020) Gao, Y.; Zhao, W.; and Eger, S. 2020. SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for Multi-Document Summarization. In ACL 2020: 58th annual meeting of the Association for Computational Linguistics, 1347–1354.
Genest and Lapalme (2011) Genest, P.-E.; and Lapalme, G. 2011. Framework for abstractive summarization using text-to-text generation. In Proceedings of the workshop on monolingual text-to-text generation, 64–73.
Goodfellow et al. (2014) Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in neural information processing systems, 2672–2680.
Gu et al. (2016) Gu, J.; Lu, Z.; Li, H.; and Li, V. O. K. 2016. Incorporating copying mechanism in sequence-to-sequence learning 1: 1631–1640.
Guo et al. (2017) Guo, J.; Lu, S.; Cai, H.; Zhang, W.; Yu, Y.; and Wang, J. 2017. Long text generation via adversarial training with leaked information. arXiv preprint arXiv:1709.08624 .
Hossain, Ghazvininejad, and Zettlemoyer (2020) Hossain, N.; Ghazvininejad, M.; and Zettlemoyer, L. 2020. Simple and Effective Retrieve-Edit-Rerank Text Generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2532–2538.
Hu, Zhang, and Yang (2019) Hu, W.; Zhang, X.; and Yang, G. 2019. Automatically Generating Macro Research Reports from a Piece of News. arXiv preprint arXiv:1911.09572 .
Keskar et al. (2019) Keskar, N. S.; McCann, B.; Varshney, L. R.; Xiong, C.; and Socher, R. 2019. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858 .
Kingma and Ba (2014) Kingma, D.; and Ba, J. 2014. Adam: A Method for Stochastic Optimization. International Conference on Learning Representations .
Kingma and Welling (2013) Kingma, D. P.; and Welling, M. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 .
Koncel-Kedziorski et al. (2019) Koncel-Kedziorski, R.; Bekal, D.; Luan, Y.; Lapata, M.; and Hajishirzi, H. 2019. Text Generation from Knowledge Graphs with Graph Transformers. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2284–2293.
Kusner and Hernández-Lobato (2016) Kusner, M. J.; and Hernández-Lobato, J. M. 2016. Gans for sequences of discrete elements with the gumbel-softmax distribution. arXiv preprint arXiv:1611.04051 .
Liao et al. (2018) Liao, Y.; Bing, L.; Li, P.; Shi, S.; and Lam, W. 2018. QuaSE: Sequence Editing under Quantifiable Guidance .
Lin et al. (2017) Lin, K.; Li, D.; He, X.; Zhang, Z.; and Sun, M.-T. 2017. Adversarial ranking for language generation. In Advances in Neural Information Processing Systems, 3155–3165.
Liu and Lapata (2019) Liu, Y.; and Lapata, M. 2019. Text Summarization with Pretrained Encoders. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3730–3740.
Mager et al. (2020) Mager, M.; Astudillo, R. F.; Naseem, T.; Sultan, A.; Lee, Y.-S.; Florian, R.; and Roukos, S. 2020. GPT-too: A language-model-first approach for AMR-to-text generation. In ACL 2020: 58th annual meeting of the Association for Computational Linguistics, 1846–1852.
McCarthy et al. (2020) McCarthy, A. D.; Li, X.; Gu, J.; and Dong, N. 2020. Addressing Posterior Collapse with Mutual Information for Improved Variational Neural Machine Translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 8512–8525.
Miao and Blunsom (2016) Miao, Y.; and Blunsom, P. 2016. Language as a latent variable: Discrete generative models for sentence compression. arXiv preprint arXiv:1609.07317 .
Miao, Yu, and Blunsom (2016) Miao, Y.; Yu, L.; and Blunsom, P. 2016. Neural variational inference for text processing. In International conference on machine learning, 1727–1736.
Peters et al. (2018) Peters, M.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; and Zettlemoyer, L. 2018. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2227–2237.
See, Liu, and Manning (2017) See, A.; Liu, P. J.; and Manning, C. D. 2017. Get to the point: Summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368 .
Semeniuta, Severyn, and Barth (2017) Semeniuta, S.; Severyn, A.; and Barth, E. 2017. A Hybrid Convolutional Variational Autoencoder for Text Generation 627–637.
Serban et al. (2017) Serban, I. V.; Klinger, T.; Tesauro, G.; Talamadupula, K.; Zhou, B.; Bengio, Y.; and Courville, A. 2017. Multiresolution recurrent neural networks: An application to dialogue response generation. In Thirty-First AAAI Conference on Artificial Intelligence.
Serban et al. (2016) Serban, I. V.; Sordoni, A.; Lowe, R.; Charlin, L.; Pineau, J.; Courville, A.; and Bengio, Y. 2016. A hierarchical latent variable encoder-decoder model for generating dialogues. arXiv preprint arXiv:1605.06069 .
Shen et al. (2019) Shen, D.; Celikyilmaz, A.; Zhang, Y.; Chen, L.; Wang, X.; Gao, J.; and Carin, L. 2019. Towards generating long and coherent text with multi-level latent variable models. arXiv preprint arXiv:1902.00154 .
Song et al. (2019) Song, K.; Tan, X.; Qin, T.; Lu, J.; and Liu, T. 2019. MASS: Masked Sequence to Sequence Pre-training for Language Generation. In Proceedings of the Thirty-sixth International Conference on Machine Learning.
Sorodoc, Gulordava, and Boleda (2020) Sorodoc, I.; Gulordava, K.; and Boleda, G. 2020. Probing for Referential Information in Language Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 4177–4189.
Sutskever, Vinyals, and Le (2014) Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to Sequence Learning with Neural Networks. Advances in neural information processing systems .
Wang and Wan (2019) Wang, T.; and Wan, X. 2019. T-CVAE: Transformer-Based Conditioned Variational Autoencoder for Story Completion. In Twenty-Eighth International Joint Conference on Artificial Intelligence IJCAI-19.
Wang et al. (2019) Wang, W.; Gan, Z.; Xu, H.; Zhang, R.; Wang, G.; Shen, D.; Chen, C.; and Carin, L. 2019. Topic-Guided Variational Auto-Encoder for Text Generation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 166–177.
Xu et al. (2020a) Xu, S.; Li, H.; Yuan, P.; Wu, Y.; He, X.; and Zhou, B. 2020a. Self-Attention Guided Copy Mechanism for Abstractive Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 1355–1362.
Xu et al. (2020b) Xu, S.; Li, H.; Yuan, P.; Wu, Y.; He, X.; and Zhou, B. 2020b. Self-Attention Guided Copy Mechanism for Abstractive Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 1355–1362. Online: Association for Computational Linguistics.
Yang et al. (2019a) Yang, P.; Li, L.; Luo, F.; Liu, T.; and Sun, X. 2019a. Enhancing Topic-to-Essay Generation with External Commonsense Knowledge. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
Yang et al. (2018) Yang, X.; Lin, X.; Suo, S.; and Li, M. 2018. Generating Thematic Chinese Poetry using Conditional Variational Autoencoders with Hybrid Decoders .
Yang et al. (2019b) Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R. R.; and Le, Q. V. 2019b. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, 5753–5763.
Yu et al. (2017) Yu, L.; Zhang, W.; Wang, J.; and Yu, Y. 2017. Seqgan: Sequence generative adversarial nets with policy gradient. In Thirty-First AAAI Conference on Artificial Intelligence.
Yu et al. (2020) Yu, M. H.; Li, J.; Liu, D.; Tang, B.; Zhang, H.; Zhao, D.; and Yan, R. 2020. Draft and Edit: Automatic Storytelling Through Multi-Pass Hierarchical Conditional Variational Autoencoder. AAAI 2020 : The Thirty-Fourth AAAI Conference on Artificial Intelligence 34(2): 1741–1748.
Zhang et al. (2020) Zhang, R.; Chen, C.; Gan, Z.; Wang, W.; Shen, D.; Wang, G.; Wen, Z.; and Carin, L. 2020. Improving Adversarial Text Generation by Modeling the Distant Future. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2516–2531.
Zhang, Gan, and Carin (2016) Zhang, Y.; Gan, Z.; and Carin, L. 2016. Generating text via adversarial training. In NIPS workshop on Adversarial Training, volume 21.
Zhao, Zhao, and Eskenazi (2017) Zhao, T.; Zhao, R.; and Eskenazi, M. 2017. Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, 654–664.
Zhu et al. (2018) Zhu, Y.; Lu, S.; Zheng, L.; Guo, J.; Zhang, W.; Wang, J.; and Yu, Y. 2018. Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, 1097–1100.