This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Importance Estimation from Multiple Perspectives for
Keyphrase Extraction

Mingyang Song, Liping Jing  and Lin Xiao
Beijing Key Lab of Traffic Data Analysis and Mining
Beijing Jiaotong University, China
{mingyang.song, lpjing, 17112079}@bjtu.edu.cn
  Corresponding author.
Abstract

Keyphrase extraction is a fundamental task in Natural Language Processing, which usually contains two main parts: candidate keyphrase extraction and keyphrase importance estimation. From the view of human understanding documents, we typically measure the importance of phrase according to its syntactic accuracy, information saliency, and concept consistency simultaneously. However, most existing keyphrase extraction approaches only focus on the part of them, which leads to biased results. In this paper, we propose a new approach to estimate the importance of keyphrase from multiple perspectives (called as KIEMP) and further improve the performance of keyphrase extraction. Specifically, KIEMP estimates the importance of phrase with three modules: a chunking module to measure its syntactic accuracy, a ranking module to check its information saliency, and a matching module to judge the concept (i.e., topic) consistency between phrase and the whole document. These three modules are seamlessly jointed together via an end-to-end multi-task learning model, which is helpful for three parts to enhance each other and balance the effects of three perspectives. Experimental results on six benchmark datasets show that KIEMP outperforms the existing state-of-the-art keyphrase extraction approaches in most cases.

1 Introduction

Keyphrase Extraction (KE) aims to select a set of reliable phrases (e.g., “harmonic balance method", “grobner base", “error bound", “algebraic representation", and “singular point" in Table 1) with salient information and central topics from a given document, which is a fundamental task in natural language processing. Most classic keyphrase extraction methods typically include two mainly components: candidate keyphrase extraction and keyphrase importance estimation Medelyan et al. (2009); Liu et al. (2010); Hasan and Ng (2014).

Input Document: harmonic balance ( hb ) method is well known principle for analyzing periodic oscillations on nonlinear networks and systems. because the hb method has a truncation error, approximated solutions have been guaranteed by error bounds. however, its numerical computation is very time consuming compared with solving the hb equation. this paper proposes proposes an algebraic representation of the error bound using grobner base. the algebraic representation enables to decrease the computational cost of the error bound considerably. moreover, using singular points of the algebraic representation, we can obtain accurate break points of the error bound by collisions.
Output / Target Keyphrases: harmonic balance method; grobner base; error bound; algebraic representation; singular point; quadratic approximation
Table 1: Sample input document with output / target keyphrases in KP20k testing set. Specially, keyphrases typically can be categorized into two types: present keyphrase that appears in a given document and absent keyphrase which does not appear in a given document.

As shown in Table 1, each keyphrase usually consists of more than one words Meng et al. (2017). To extract the candidate keyphrases from the the given document which is typically characterized via word-level representation, researchers leverage some heuristics Wan and Xiao (2008); Liu et al. (2009a, b); Nguyen and Phan (2009); Grineva et al. (2009); Medelyan et al. (2009) to identify the candidate keyphrases. For example, the word embeddings are composed to n-grams by Convolution Neural Network (CNN) Xiong et al. (2019); Sun et al. (2020); Wang et al. (2020).

Usually, the candidate set contains much more keyphrases than the ground truth keyphrase set. Therefore, it is critical to select the important keyphrase from the candidate set by a good strategy. In other words, keyphrase importance estimation commonly is one of the essential components in many keyphrase extraction models. Since keyphrase extraction concerns “the automatic selection of important and topical phrases from the body of a document” Turney (2000). Its goal is to estimate the importance of the candidate keyphrases to determine which one should be extracted. Recent approaches Sun et al. (2020); Wang et al. (2020) recast the keyphrase extraction as a classification problem, which extracts keyphrases by a binary classifier. However, a binary classifier classifies each candidate keyphrase independently, and consequently, it does not allow us to determine which candidates are better than the others Hulth (2004). Therefore, some methods Jiang et al. (2009); Xiong et al. (2019); Wang et al. (2020); Sun et al. (2020) propose a ranking model to extract keyphrases, where the goal is to learn a phrase ranker to compare the saliency of two candidate phrases. Furthermore, many previous studies Liu et al. (2010); Wang et al. (2019); Liu et al. (2009b) extract keyphrases with the main topics discussed in the source document, For example, Liu et al. (2010) proposes to build a topical PageRank approach to measure the importance of words concerning different topics.

However, most existing keyphrase extraction methods estimate the importance of keyphrases on at most two perspectives, leading to biased extraction. Therefore, to improve the performance of keyphrase extraction, the importance of the candidate keyphrases requires to be estimated sufficiently from multiple perspectives. Motivated by the phenomenon mentioned above, we propose a new importance estimation from multiple perspectives simultaneously for the keyphrase extraction task. Concretely, it estimates the importance from three perspectives with three modules (syntactic accuracy, information saliency, and concept consistency) with three modules. A chunking module, as a binary classification layer, measures the syntactic accuracy of each candidate keyphrase. A ranking module checks the semantics saliency of each candidate phrase by a pairwise ranking approach, which introduces competition between the candidate keyphrases to extract more salient keyphrases. A matching module judges the concept relevance of each candidate phrase in the document via a metric learning framework. Furthermore, our model is trained jointly on the above three modules, balancing the effect of three perspectives. Experimental results on two benchmark data sets show that KIEMP outperforms the existing state-of-the-art keyphrase extraction approaches in most cases.

2 Related Work

A good keyphrase extraction system typically consists of two steps: (1) candidate keyphrase extraction, extracting a list of words / phrases that serve as the candidate keyphrases using some heuristics Wan and Xiao (2008); Nguyen and Phan (2009); Medelyan et al. (2009); Grineva et al. (2009); Liu et al. (2009a, b); and (2) keyphrase importance estimation, determining which of these candidate phrases are keyphrases using different importance estimation approaches.

In the candidate keyphrase extraction, the heuristic rules usually are designed to avoid spurious phrases and keep the number of candidates to a minimum Hasan and Ng (2014). Generally, the heuristics mainly include (1) leverage a stop word list Liu et al. (2009b), (2) allowing words with part-of-speech tags Mihalcea and Tarau (2004); Liu et al. (2009a), (3) composing words to n-grams to be the candidate keyphrases Medelyan et al. (2009); Sun et al. (2020); Xiong et al. (2019); Wang et al. (2020). The above heuristics have proven effective with their high recall in extracting gold keyphrases from various sources. Motivated by the above methods, in this paper, we leverage CNNs to compose words to n-grams as the candidate keyphrases.

In the keyphrase importance estimation, the existing methods can be mainly divided into two categories: unsupervised and supervised. The unsupervised method usually are categorized into four groups, i.e., graph-based ranking Mihalcea and Tarau (2004), topic-based clustering Liu et al. (2009b), simultaneous learning Zha (2002), and language modeling Tomokiyo and Hurst (2003). Early supervised approaches to keyphrase extraction recast this task as a binary classification problem Witten et al. (1999); Turney (2002, 2000); Jiang et al. (2009). Later, to determine which candidates are better than the others, many ranking approach is proposed to rank the saliency of two phrases Jiang et al. (2009); Sun et al. (2020). This pairwise ranking approach, therefore, introduces competition between candidate keyphrases and has been achieved good performance. Both supervised and unsupervised methods construct features or models from different perspectives to measure the importance of candidate keyphrases to determine which keyphrases should be extracted. However, the approaches mentioned earlier consider at most two perspectives when measuring the importance of phrases, which leads to biased keyphrase extraction. Different from the existing methods, the proposed KIEMP considers estimating the importance of the candidate keyphrases from multiple perspectives simultaneously.

3 Methodology

We formally define the problem of keyphrase extraction as follows. In this paper, KIEMP takes a document D={w1,,wi,,wM}{D}=\{w_{1},...,w_{i},...,w_{M}\} and learns to extract a set of keyphrases K{K} (each keyphrase may be composed of one or several word(s)) from their n-gram based representations under multiple perspectives.

This section describes the architecture of KIEMP, as shown in Figure 1. KIEMP mainly consists of two submodels: candidate keyphrase extraction and keyphrase importance estimation. The former first identifies and extracts the candidate keyphrases. Then the latter estimates the importance of keyphrases from three perspectives simultaneously with three modules to determine which one should be extracted.

Refer to caption
Figure 1: The KIEMP model architecture.

3.1 Contextualized Word Representation

Recently, pre-trained language models Peters et al. (2018); Devlin et al. (2019); Liu et al. (2019) have emerged as a critical technology for achieving impressive gains in a wide variety of natural language tasks Liu and Lapata (2019). These models extend the idea of word embeddings by learning contextual representations from large-scale corpora using a language modeling objective. In this situation, Xiong et al. (2019) propose to represent each word by its ELMo Peters et al. (2018) embedding and Sun et al. (2020) leverage variants of BERT Devlin et al. (2019); Liu et al. (2019) to obtain contextualized word representations. Motivated by the above approaches, we represent each word by RoBERTa Liu et al. (2019), which encodes D{D} to a sequence of vector H={h1,,hi,,hM}{H}=\{h_{1},...,h_{i},...,h_{M}\}:

H=RoBERTa{w1,,wi,,wM},{H}=\text{RoBERTa}\{w_{1},...,w_{i},...,w_{M}\}, (1)

where hidh_{i}\in\mathbb{R}^{d} indicates the ii-th contextualized word embedding of wiw_{i} from the last transformer layer in RoBERTa. Specifically, the [CLS] token of RoBERTa is used as the document representation.

3.2 Candidate Keyphrase Extraction

In the keyphrase extraction task, keyphrase usually contains more than one word, as shown in Table 1. Therefore, it is necessary to identify the candidate keyphrases via some strategies. Previous work Medelyan et al. (2009); Sun et al. (2020); Wang et al. (2020); Xiong et al. (2019) allow n-grams that appear in the document to be the candidate keyphrases. Motivated by the previous approaches, we consider the language properties Xiong et al. (2019) and compose the contextualized word representations to n-grams by CNNs (similar to Sun et al. (2020)). Specifically, the phrase representation of the ii-th nn-gram cinc_{i}^{n} is computed as:

hin=CNNn(hi:i+n),h_{i}^{n}=\text{CNN}^{n}(h_{i:i+n}), (2)

where hindh_{i}^{n}\in\mathbb{R}^{d} indicates the ii-th nn-gram representation. Concretely, n[1,N]n\in[1,N] is the length of n-grams, and NN indicates the maximum length of allowed candidate n-grams. Specifically, each n-gram has its own set of convolution filters CNNn\text{CNN}^{n} with window size nn and stride 11.

3.3 Keyphrase Importance Estimation

In the keyphrase extraction models, keyphrase importance estimation commonly is one of the essential components. To improve the accuracy of keyphrase extraction, we estimate the importance of keyphrases from three perspectives simultaneously with three modules: chunking for syntactic accuracy, ranking for information saliency, and matching for concept consistency.

3.3.1 Chunking for Syntactic Accuracy

Many studies Turney (2002); Witten et al. (1999); Turney (2000) regard keyphrase extraction as a classification task, in which a model is trained to determine whether a candidate phrase is a keyphrase in a syntactic perspective. For example, Xiong et al. (2019); Sun et al. (2020) directly predict whether the n-gram is a keyphrase based on its corresponding representation. Motivated by these above methods, in this paper, the syntactic accuracy of phrase cinc_{i}^{n} is estimated by a chunking module:

I1(cin)=softmax(𝐖1hin+b1),I_{1}(c_{i}^{n})=\text{softmax}(\mathbf{W}_{1}h_{i}^{n}+b_{1}), (3)

where 𝐖1\mathbf{W}_{1} and b1b_{1} indicate a trainable matrix and a bias. The softmax is taken over all possible n-grams at each position ii and each length nn. The whole model is trained using cross-entropy loss:

Lc=CrossEntropy(yin,I1(cin)),L_{c}=\text{CrossEntropy}(y_{i}^{n},I_{1}(c_{i}^{n})), (4)

where yiny_{i}^{n} is the label of whether the phrase cinc_{i}^{n} is a keyphrase of the original document.

3.3.2 Ranking for Information Saliency

The binary classifier-based keyphrase extraction model classifies each candidate keyphrase independently, and consequently, it does not allow us to determine which candidates are better than the others Hulth (2004). However, the goal of keyphrase extraction is to identify the most salient phrases for a document Hasan and Ng (2014). Therefore, a ranking model is required to rank the saliency of the candidate keyphrases. We leverage a pairwise learning approach to rank the candidate keyphrases globally to compare the information saliency between all candidates. First, we put the candidate keyphrases in the document that are labeled as keyphrases, in the positive set 𝐏+\mathbf{P}^{+}, and the others to the negative set 𝐏\mathbf{P}^{-}, to obtain the ranking labels. Then, the loss function is the standard hinge loss in the pairwise learning model:

Lr=p+,pKmax(0,δ1I2(p+)+I2(p)),L_{r}=\sum_{p^{+},p^{-}\in K}\text{max}(0,\delta_{1}-I_{2}(p^{+})\\ +I_{2}(p^{-})), (5)

where I2()I_{2}(\cdot) represents the estimation of information saliency and δ1\delta_{1} indicates the margin. It enforces KIEMP to rank the candidate keyphrases p+p^{+} ahead of pp^{-} within the same document. Specifically, the information saliency of the ii-th n-gram representation cinc_{i}^{n} can be computed as follows:

I2(cin)=𝐖2hin+b2,I_{2}(c_{i}^{n})=\mathbf{W}_{2}h_{i}^{n}+b_{2}, (6)

where 𝐖2\mathbf{W}_{2} is a trainable matrix, and b2b_{2} is a bias. Through the pairwise learning model, we can rank the information saliency of all candidates and extract the keyphrases with more salient information sufficiently.

3.3.3 Matching for Concept Consistency

As phrases are used to express various meanings corresponding to different concepts (i.e., topics), a phrase will play different important roles in different concepts of the document Liu et al. (2010). A matching module is proposed via a metric learning framework to estimate the concept consistency between the candidate keyphrases and their corresponding document. We first apply variation autoencoder Rezende et al. (2014) on the documents 𝐃\mathbf{D} and the candidate keyphrases 𝐊\mathbf{K} to obtain their concepts. Each document DD is encoded via a latent variable zcz\in\mathbb{R}^{c} which is assumed to be sampled from a standard Gaussian prior, i.e., zp(z)=𝒩(0,Id)z\sim p({z})=\mathcal{N}(0,{I}_{d}). Such variable has ability to determine the latent concepts hidden in the documents and will be useful to extract keyphrase Wang et al. (2019). During the encoding process, z{z} can be sampled via a re-parameterization trick for Gaussian distribution, i.e., zq(z|D)=𝒩(μ,σ){z}\sim q({z}|{D})=\mathcal{N}({\mu},{\sigma}). Specifically, we sample an auxiliary noise variable εN(0,I){\varepsilon}\sim N(0,{I}) and re-parameterization z=μ+σε{z}={\mu}+{\sigma}\odot\varepsilon, where \odot denotes the element-wise multiplication. The mean vector μc{\mu}\in\mathbb{R}^{c} and variance vector σc{\sigma}\in\mathbb{R}^{c} will be inferred by a two-layer network with ReLU-activated function, i.e., μ=μϕ(D){\mu}=\mu_{\phi}({D}) and σ=σϕ(D){\sigma}=\sigma_{\phi}({D}) where ϕ\phi is the parameter set. During the decoding process, the document can be reconstructed by a multi-layer network (fkf_{k}) with Tanh-activated function, i.e., D~=fk(z){\tilde{D}}=f_{k}({z}). Furthermore, the candidate keyphrases are processed in the same way as the documents.

Once having the latent concept representation of the document zz and the phrase zinz_{i}^{n}, the concept consistency can be estimated as follows,

I3(cin,D)=zin𝐖3z.I_{3}(c_{i}^{n},D)=z_{i}^{n}\mathbf{W}_{3}z. (7)

Here, 𝐖3\mathbf{W}_{3} is a learnable mapping matrix. The loss function is the triplet loss in the metric learning framework calculated as follows:

Lm=p+,pKmax(0,I3(p,D)I3(p+,D)+δ2),L_{m}=\sum_{p^{+},p^{-}\in K}\text{max}(0,I_{3}(p^{-},D)-I_{3}(p^{+},D)+\delta_{2}), (8)

where δ2\delta_{2} represents the margin. It enforces KIEMP to match and rank the concept consistency of keyphrases p+p^{+} ahead of the non-keyphrases pp^{-} within their corresponding document DD.

Furthermore, to simultaneously minimize the reconstruction loss and penalize the discrepancy between a prior distribution and posterior distribution about the latent variable z{z}, the VAE process can be implemented by optimizing the following objective function for the documents LdL_{d} and the candidate keyphrases LkL_{k}:

Ld=𝔼q(𝐳|𝐃)[p(𝐃|𝐳)]+DKL(p(𝐳)||q(𝐳|𝐃)),L_{d}=-\mathbb{E}_{q(\mathbf{z}|\mathbf{D})}\big{[}p(\mathbf{D}|\mathbf{z})\big{]}+D_{KL}\big{(}p(\mathbf{z})||q(\mathbf{z}|\mathbf{D})\big{)}, (9)
Lk=𝔼q(𝐳|𝐊)[p(𝐊|𝐳)]+DKL(p(𝐳)||q(𝐳|𝐊)),L_{k}=-\mathbb{E}_{q(\mathbf{z}|\mathbf{K})}\big{[}p(\mathbf{K}|\mathbf{z})\big{]}+D_{KL}\big{(}p(\mathbf{z})||q(\mathbf{z}|\mathbf{K})\big{)}, (10)

where DKLD_{KL} indicates the Kullback-Leibler divergence between two distributions. And the final loss of this module is calculated as follows:

Lt=Lm+λLd+(1λ)Lk,L_{t}=L_{m}+\lambda L_{d}+(1-\lambda)L_{k}, (11)

where λ(0,1)\lambda\in(0,1) indicates the balance factor. Through concept consistency matching, we expect to align keyphrases with high-level concepts (i.e., topics or structures) in the document to assist the model in extracting keyphrases with more important concepts.

3.4 Model Training and Inference

Multi-task learning has played an essential role in various fields Srna et al. (2018), and has been widely used in the natural language processing tasks Sun et al. (2020); Mu et al. (2020) recently. Therefore, our framework allows end-to-end learning of syntactic chunking, saliency ranking, and concept matching in this paper. Then, we define the training objective of the entire framework with the linear combination of LcL_{c}, LrL_{r}, and LtL_{t}:

L=ϵ1Lc+ϵ2Lr+ϵ3Lt,L=\epsilon_{1}L_{c}+\epsilon_{2}L_{r}+\epsilon_{3}L_{t}, (12)

where the hyper-parameters ϵ1\epsilon_{1}, ϵ2\epsilon_{2}, and ϵ3\epsilon_{3} balance the effects of the importance estimation from three perspectives. Specifically, ϵ1+ϵ2+ϵ3=1\epsilon_{1}+\epsilon_{2}+\epsilon_{3}=1.

In this paper, KIEMP aims to extract keyphrases according to their saliency. It contains three modules syntactic accuracy chunking, information saliency ranking, and concept consistency matching. Chunking and matching are used to enforce the ranking module to rank the proper candidate keyphrases ahead. Therefore, only the ranking module is used in the inference process (test-phase).

Dataset Document Len. # Keyphrase Keyphrase Len.
Average Average Average
OpenKP 900.4 1.8 2.0
KP20k 179.8 5.3 2.0
Inspec 128.7 9.8 2.5
Krapivin 182.6 5.8 2.2
Nus 219.1 11.7 2.2
SemEval 234.8 14.7 2.4
Table 2: Statistics of six benchmark datasets. Document Len. and Keyphrase Len. represent the number of words in the document and keyphrase respectively.

4 Experimental Settings

4.1 Datasets

Six benchmark datasets are mainly used in our experiments, OpenKP Xiong et al. (2019), KP20k Meng et al. (2017), Inspec Hulth (2003), Krapivin Krapivin and Marchese (2009), Nus Nguyen and Kan (2007) and SemEval Kim et al. (2010). Table 2 summarizes the statistics of each testing sets.

OpenKP consists of around 150K documents sampled from the index of the Bing search engine. In OpenKP, we follow the official split of training (134K documents), development (6.6K documents), and testing (6.6K documents) sets. The keyphrases for each document in OpenKP were labeled by expert annotators, with each document assigned 1-3 keyphrases. As a requirement, all the keyphrases appeared in the original document Xiong et al. (2019).

KP20k contains a large number of high-quality scientific metadata in the computer science domain from various online digital libraries Meng et al. (2017). We follow the official setting of this dataset and split the dataset into training (528K documents), validation (20K documents), and testing (20K documents) sets. From the training set of KP20k, we remove all articles that are duplicated in themselves, either in the KP20k validation and testing set. After the cleanup, the KP20k dataset contains 504K training samples, 20K validation samples, and 20K testing samples.

To verify the robustness of KIEMP, we also test the model trained with KP20k dataset on four widely-adopted keyphrase extraction data sets including Inspec, Krapivin, Nus, and SemEval.

In this paper, we focus on keyphrase extraction. Therefore, only the keyphrases that appear in the documents are used for training and evaluation.

Hyper-parameter Dimension or Value
λ\lambda 0.50.5
ϵ1,ϵ2,ϵ3\epsilon_{1},\epsilon_{2},\epsilon_{3} 1/3
δ1,δ2\delta_{1},\delta_{2} 1.0
Optimizer AdamW
Learning Rate 1×1051\times 10^{-5}
Batch Size 3232
Warm-Up Proportion 10%10\%
RoBERTa Embedding (d)(\mathbb{R}^{d}) 768
Concept Dimension (c)(\mathbb{R}^{c}) 64
Max Sequence Length 512
Maximum Phrase Length (N)(N) 5
Table 3: Parameters used for training KIEMP.
Model OpenKP KP20k
R@1R@1 R@3R@3 R@5R@5 F1@1F_{1}@1 F1@3F_{1}@3 F1@5F_{1}@5 F1@5F_{1}@5 F1@10F_{1}@10
Unsupervised Methods
TFIDF Jones (2004) 0.150 0.284 0.347 0.196* 0.223* 0.196* 0.105 0.130
TextRank Mihalcea and Tarau (2004) 0.041 0.098 0.142 0.054* 0.076* 0.079* 0.180 0.150
Supervised Methods with Additional Features
BLING-KPE Xiong et al. (2019) 0.220 0.390 0.481 0.285* 0.303* 0.270* - -
SMART-KPE+R2J Wang et al. (2020) 0.307 0.532 0.625 0.381 0.405 0.347 - -
Supervised Methods without Additional Features
CopyRNN Meng et al. (2017) 0.174 0.331 0.413 0.217* 0.237* 0.210* 0.327 0.278
DivGraphPointer Sun et al. (2019) - - - - - - 0.368 0.292
Div-DGCN Zhang et al. (2020) - - - - - - 0.349 0.313
SKE-Large-CLS Mu et al. (2020) - - - - - - 0.392 0.330
ChunkKPE Sun et al. (2020) 0.283 0.486 0.581 0.355 0.373 0.324 0.408 0.337
RankKPE Sun et al. (2020) 0.290 0.509 0.604 0.361 0.390 0.337 0.417 0.343
JointKPE Sun et al. (2020) 0.291 0.511 0.605 0.364 0.391 0.338 0.419 0.344
KIEMP 0.298 0.517 0.615 0.369 0.392 0.340 0.421 0.345
Table 4: Performances of keyphrase extraction model on the OpenKP development set and the KP20k testing set. The best results of our model are highlighted in bold, and the best results of baselines are underlined. * indicates these numbers are not included in the original paper and are estimated with Precision and Recall. The results of the baselines are reported in their corresponding papers.

4.2 Baselines

This paper focuses on the comparisons with the state-of-the-art baselines and chooses the following keyphrase extraction models as our baselines.

TextRank An unsupervised algorithm based on weighted-graphs proposed by Mihalcea and Tarau (2004). Given a word graph built on co-occurrences, it calculates the importance of candidate words with PageRank. The importance of a candidate keyphrase is then estimated as the sum of the scores of the constituent words.

TFIDF Jones (2004) is computed based on candidate frequency in the given text and inverse document frequency

CopyRNN Meng et al. (2017) which uses the attention mechanism as the copy mechanism to extract keyphrases from the given document.

BLING-KPE Xiong et al. (2019) first concatenates the pre-trained language model (ELMo Peters et al. (2018)) as word embeddings, visual as well as positional features, and then uses a CNN network to obtain n-gram phrase embeddings for binary classification.

JointKPE Sun et al. (2020) jointly learns a chunking model (ChunkKPE) and a ranking model (RankKPE) for keyphrase extraction.

SMART-KPE+R2J Wang et al. (2020) presents a multi-modal method to the keyphrase extraction task, which leverages lexical and visual features to enable strategy induction as well as meta-level features to aid in strategy selection.

DivGraphPointer Sun et al. (2019) combines the advantages of traditional graph-based ranking methods and recent neural network-based approaches. Furthermore, they also propose a diversified point network to generate a set of diverse keyphrases out of the word graph in the decoding process.

Div-DGCN Zhang et al. (2020) proposes to adopt the Dynamic Graph Convolutional Networks (DGCN) to acquire informative latent document representation and better model the compositionality of the target keyphrases set.

SKE-Large-CLS Mu et al. (2020) obtains span-based representation for each keyphrase and further learns to capture the similarity between keyphrases in the source document to get better keyphrase predictions.

In this paper, for ease of introduction, all the baselines are divided according to the following three perspectives, syntax, saliency, and combining syntax and saliency. Among them, BLING-KPE, CopyRNN, ChunkKPE belong to the former, TFIDF, TextRank, as well as RankKPE belong to the second, and DivGraphPointer, Div-DGCN, SKE-Large-CLS, SMART-KPE+R2J, and JointKPE belong to the last.

4.3 Evaluation Metrics

For the keyphrase extraction task, the performance of keyphrase model is typically evaluated by comparing the top kk predicted keyphrases with the target keyphrases (ground-truth labels). The evaluation cutoff kk can be a fixed number (e.g., F1@5F_{1}@5 compares the top-55 keyphrases predicted by the model with the ground-truth to compute an F1F_{1} score). Following the previous work Meng et al. (2017); Sun et al. (2019), we adopt macro-averaged recall and F-measure (F1F_{1}) as evaluation metrics, and kk is set to be 1, 3, 5, and 10. In the evaluation, we apply Porter Stemmer Porter (2006) to both target keyphrases and extracted keyphrases when determining the match of keyphrases and match of the identical word.

4.4 Implementation Details

Implementation details of our proposed models are as follows. The maximum document length is 512 due to BERT limitations Devlin et al. (2019), and documents are zero-padded or truncated to this length. The training used 4 GeForce RTX 2080 Ti GPUs and took about 31 hours and 77 hours for OpenKP and KP20k datasets respectively. Table 3 lists the parameters of our model. Furthermore, the model was implemented in Pytorch Paszke et al. (2019) using the huggingface re-implementation of RoBERTa Wolf et al. (2019).

5 Results and Analysis

This section investigates the performance of the proposed KIEMP on six widely-used benchmark datasets (OpenKP, KP20k, Inspec, Krapivin, Nus, and Semeval) from three facets. The first one demonstrates its superiority by comparing it with ten baselines in terms of several metrics. The second one is to verify the sensitivity of the concept dimension. The last one is to explicitly show the keyphrase extraction results of KIEMP via two examples (two testing documents).

5.1 Overall Performance

The overall performance of different algorithms on two benchmarks (OpenKP and KP20k) is summarized in Table 4. We can see that the supervised methods outperform all the unsupervised algorithms (TFIDF and TextRank). This is not surprising since the supervised methods are trained end-to-end with supervised data. In all the supervised baselines, the methods using additional features are better than those without additional features. The reason is that the models with additional features are equal to encode keyphrases from multiple features perspectives. Therefore, it is helpful for the model to measure the importance of each keyphrase, thus improving the performance of the result of keyphrase extraction. Intuitively, this is the same as our proposed method. KIEMP considers the importance of keyphrases from multiple perspectives and fairly measures the importance of each keyphrase. But the difference is that we do not need additional features to assist. And in many practical applications of keyphrase extraction, there is no additional feature (i.e., visual features) information to use in most cases. Compared with recent baselines (ChunkKPE, RankKPE, and JointKPE), KIEMP performs stably better on all metrics on both two datasets. These results demonstrate the benefits of estimating the importance of keyphrases from multiple perspectives simultaneously and the effectiveness of our multi-task learning strategy.

Refer to caption
Figure 2: Results of keyphrase extraction model on four testing sets (Semeval, Inspec, Krapivin, and Nus). The results of JointKPE are re-evaluated via the code which is provided by its corresponding paper.

Furthermore, to verify the robustness of KIEMP, we also test the KIEMP trained with KP20k dataset on four widely-adopted keyphrase extraction data sets. It can be seen from Figure 2 that KIEMP is superior to the best baseline (JointKPE). We consider that this phenomenon comes from two benefits. One is that the high-level concepts captured by a deep latent variable model may contain topic and structure features. These features are essential information to evaluate the importance of phrases. Another one is that the latent variable is characterized by a probability distribution over possible values rather than a fixed value, which can enforce the uncertainty of our model and further lead to robust representation learning.

Concept Dimension (c)(\mathbb{R}^{c}) OpenKP
R@1R@1 R@3R@3 R@5R@5
64 0.298 0.517 0.615
256 0.297 0.513 0.610
512 0.296 0.509 0.609
768 0.293 0.508 0.606
Table 5: Effectiveness of different dimensions of latent concept representation. The best results are highlighted in bold.
(A) Part of the Input Document:
The Great Plateau is a large region of land that is secluded from other parts of Hyrule, as its steep slopes prevent anyone from traveling to and from it without special equipment, such as the Paraglider. The only active inhabitant is the Old Man, a mysterious … (URL: https://zelda.gamepedia.com/Great_Plateau)
Target Keyphrase: (1) great plateau ; (2) breath of the wild ; (3) hyrule
KIEMP without concept consistency matching: (1) great plateau ; (2) hyrule ; (3) breath of the wild ; (4) paraglider ; (5) zelda
KIEMP: (1) great plateau ; (2) breath of the wild ; (3) hyrule ; (4) paraglider ; (5) starting region
(B) Part of the Input Document:
Transformational leaders also depend on visionary leadership to win over followers, but they have an added focus on employee development. For example, a transformational leader might explain how her plan for the future serves her employees’ interests and how she will support them through the changes … (URL: https://yourbusiness.azcentral.com/managers-different-leadership-styles-motivate-teams-8481.html)
Target Keyphrase: (1) managers ; (2) leadership ; (3) teams
KIEMP without concept consistency matching: (1) motivating ; (2) motivate ; (3) charismatic leadership ; (4) transformational leadership ; (5) employee development
KIEMP: (1) leadership styles; (2) managers ; (3) charismatic leadership ; (4) transformational leadership ; (5) leadership
Table 6: Example of keyphrase extraction results (selected from the OpenKP dataset). Phrases in red and bold are target keyphrases predicted by the different models (KIEMP without concept consistency matching and KIEMP).

5.2 Sensitivity of the Concept Dimension

Here, we verify the effectiveness of using different concept dimensions. From Table 5, we can find that the increase of the dimension of latent concept representation has little effect on the result of keyphrase extraction. In contrast, the smaller the dimension, the better the result. Furthermore, in Table 4, the improvement of our proposed KIEMP model on the F1@1F_{1}@1 evaluation metric is higher than the F1@3F_{1}@3 and F1@5F_{1}@5 evaluation metrics on the OpenKP dataset. We consider the main reason is that our concept representation may capture the high-level conceptual information of phrases or documents, such as topics and structure information. Therefore, KIEMP with concept consistency matching module focuses more on extracting keyphrases closest to the main topic of the given document.

5.3 Case Study

To further illustrate the effectiveness of the proposed model, we present a case study on the results of the keyphrases extracted by different algorithms. Table 6 presents the results of KIEMP without concept consistency matching and KIEMP. From the first example, we can see that our KIEMP model is more inclined to extract keyphrases closer to the central semantics of the input document, which benefits from our concept consistency matching model. From the second example, we can see that the keyphrases extracted by KIEMP without concept consistency matching contain some redundant or meaningless phrases. The main reason may be that the KIEMP without concept consistency matching does not measure the importance of phrases from multiple perspectives, which leads to biased extraction. On the contrary, the keyphrases extracted by KIEMP are all around the main concepts of the example document, i.e., “leadership”. It further demonstrates the effectiveness of our proposed model.

6 Conclusions and Future Work

A new keyphrase importance estimation from the multiple perspectives approach is proposed to estimate the importance of keyphrase. Benefiting from the designed syntactic accuracy chunking, information saliency ranking, and concept consistency matching modules, KIEMP can fairly extract keyphrases. A series of experiments have demonstrated that KIEMP outperformed the existing state-of-the-art keyphrase extraction methods. In the future, it will be interesting to introduce an adaptive approach in KIEMP to filter the meaningless phrases.

7 Acknowledgments

This work was supported in part by the National Key Research and Development Program of China under Grant 2020AAA0106800; the National Science Foundation of China under Grant 61822601 and 61773050; the Beijing Natural Science Foundation under Grant Z180006; The Fundamental Research Funds for the Central Universities (2019JBZ110).

References