This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Abstractive Opinion Tagging

Qintong Li Shandong University [email protected] Piji Li Tencent AI Lab [email protected] Xinyi Li Peng Cheng Laboratory [email protected] Zhaochun Ren Shandong University [email protected] Zhumin Chen Shandong University [email protected]  and  Maarten de Rijke 0000-0002-1086-0202 University of Amsterdam & Ahold Delhaize [email protected]
(2021)
Abstract.

In e-commerce, opinion tags refer to a ranked list of tags provided by the e-commerce platform that reflect characteristics of reviews of an item. To assist consumers to quickly grasp a large number of reviews about an item, opinion tags are increasingly being applied by e-commerce platforms. Current mechanisms for generating opinion tags rely on either manual labelling or heuristic methods, which is time-consuming and ineffective. In this paper, we propose the abstractive opinion tagging task, where systems have to automatically generate a ranked list of opinion tags that are based on, but need not occur in, a given set of user-generated reviews.

The abstractive opinion tagging task comes with three main challenges: (1) the noisy nature of reviews; (2) the formal nature of opinion tags vs. the colloquial language usage in reviews; and (3) the need to distinguish between different items with very similar aspects. To address these challenges, we propose an abstractive opinion tagging framework, named AOT-Net, to generate a ranked list of opinion tags given a large number of reviews. First, a sentence-level salience estimation component estimates each review’s salience score. Next, a review clustering and ranking component ranks reviews in two steps: first, reviews are grouped into clusters and ranked by cluster size; then, reviews within each cluster are ranked by their distance to the cluster center. Finally, given the ranked reviews, a rank-aware opinion tagging component incorporates an alignment feature and alignment loss to generate a ranked list of opinion tags. To facilitate the study of this task, we create and release a large-scale dataset, called eComTag, crawled from real-world e-commerce websites. Extensive experiments conducted on the eComTag dataset verify the effectiveness of the proposed AOT-Net in terms of various evaluation metrics.

Review analysis; abstractive summarization; e-commerce
journalyear: 2021copyright: acmlicensedconference: Proceedings of the Fourteenth ACM International Conference on Web Search and Data Mining; March 8–12, 2021; Virtual Event, Israelbooktitle: Proceedings of the Fourteenth ACM International Conference on Web Search and Data Mining (WSDM ’21), March 8–12, 2021, Virtual Event, Israelprice: 15.00doi: 10.1145/3437963.3441804isbn: 978-1-4503-8297-7/21/03ccs: Information systems Summarizationccs: Information systems Sentiment analysis

1. Introduction

Reviews of a hot-pot restaurant
U1U_{1}: The waitress was extremely attentive and even gave us a free fried man tou dessert that came with condensed milk for dipping…I love it!! U2U_{2}: I was pleasantly surprised about how yummy the dish and the lamb were… U3U_{3}: All in all; was a great experience and the service is really above and beyond. U4U_{4}: The restaurant guest is more, can be served quickly, our table was quickly dish bowl filled with. Overall cool experience. U5U_{5}: The shrimp was fresh and the pork mixture was tasty. U6U_{6}: Fairly quick and polite service. It worth that price! … excellent, relaxed and cozy atmosphere, and what can I say, satisfying. UNU_{N}: Food is delicious, reasonably priced…Go here! you deserve it!
Opinion tags
hospitable service (223), delicious food (165), value for money (104), comfortable environment (65), served quickly (14).
Figure 1. An example of a set of reviews and their corresponding opinion tags, where UNU_{N} denotes the index of reviews, whereas the number behind each opinion tag reflects the number of reviews belonging to the tag.

With the explosive growth of customer reviews in e-commerce scenarios, many online platforms, such as Amazon111https://www.amazon.com/ and Alibaba222https://www.alibaba.com/, provide opinion tags to enable potential buyers to make informed decisions without having to absorb large numbers of reviews. As shown in Figure 1, a sequence of opinion tags is a ranked list mined from a set of reviews that reflects different users’ preferences towards certain aspects of items. Many studies have focused on mining valuable information from reviews and shown promising results in various tasks, such as opinion summarization (Amplayo and Lapata, 2019; Li et al., 2020; Brazinskas et al., 2020; Amplayo and Lapata, 2020; Suhara et al., 2020; Li et al., 2017, 2019a; Carmeli et al., 2020) and item description generation (Novgorodov et al., 2019; Elad et al., 2019). So far, however, no study seems to have developed opinion tagging methods to generate opinion tags that reflect diverse opinions of item aspects in a concise manner. In this paper, we propose the task of abstractive opinion tagging, which aims to automatically generate a ranked list of opinion tags that stem from, but need not occur in, a given set of user-generated reviews. This is an abstractive rather than an extractive opinion tagging task as the opinion tags do not need to occur in the review.

To solve this new task, we face three challenges. First, in reviews of e-commerce items, the noisy nature of the reviews inevitably makes it hard to identify salient item-related features (Gao et al., 2019; Li et al., 2020). According to human annotations (on the eComTag dataset described below), we find that almost 59% of review sentences are not item-related or do not contain opinions towards certain aspects. Second, different reviewers have different ways of expressing themselves (Tang et al., 2015), whereas the target opinion tags are usually in a more formal style. The colloquial language usage of user reviews (Wang et al., 2019a) makes abstractive methods necessary to learn better review representations. Third, it is difficult to reflect the different perspectives on an item in an accurate and diverse manner. Many e-commerce platforms rank opinion tags to help customers distinguish between different items with very similar aspects. In Figure 2, we plot the number of relevant reviews on different ranked indexes of opinion tags in the eComTag dataset (described below), which suggests a natural ranking for the tags when presented to users.

Refer to caption
Figure 2. Fraction of relevant reviews per opinion tag in the eComTag dataset.
\Description

Reviews per opinion tag.

To address the challenges listed above, we design an abstractive framework, named AOT-Net, which consists of three components: (1) a sentence-level salience estimation component that predicts a salience score for each review; (2) a review clustering and ranking component that first groups reviews into clusters, which we refer to as “opinion clusters,” and ranks reviews by opinion cluster size; this component then ranks reviews within each cluster by their distance to the cluster center; and (3) a rank-aware opinion tagging component that generates opinion tags with ranks.

AOT-Net works in such a way that the ranks of the opinion clusters are correlated with the ranks of opinion tags. To see why this is potentially useful, we consider the eComTag dataset (crawled from e-commerce sites and consisting of items, reviews, and opinion tags), group reviews into opinion clusters and rank clusters by cluster size. We then compute the semantic similarity between the opinion clusters and the opinion tags, obtained by averaging the semantic similarity between an opinion tag and reviews in an opinion cluster. Figure 3 visualizes the average semantic similarity between 1010 opinion tags and 1010 opinion clusters for all item samples in the eComTag dataset. Clearly, the opinion tags have broader chunks with the opinion clusters at the same rank, revealing an alignment between the ranked opinion tags and ranked opinion clusters. Furthermore, the opinion tags are semantically similar to the neighbors of the corresponding opinion clusters, which is reflected by the relatively wide color bands, e.g., the second opinion tag is semantically similar to the first opinion cluster and the third opinion cluster. The rank-aware opinion tagging component of AOT-Net integrates two alignment strategies: (1) an opinion tag and its corresponding opinion clusters are placed at similar ranks; and (2) an alignment loss explicitly encourages the model to focus on the aligned opinion clusters and ignore others.

To validate AOT-Net, we collect a new dataset, named eComTag, from Chinese e-commerce websites, containing reviews and opinion tags for 50,068 items. Our experiments based on the eComTag dataset show that AOT-Net is capable of significantly improving the generation performance on abstractive opinion tagging over state-of-the-art baselines.

Refer to caption
Figure 3. The average semantic similarity between opinion tags and opinion clusters for all samples in the eComTag dataset. Each color denotes an opinion cluster. The same opinion clusters share the same color across different opinion tags. The width of the color band denotes the degree of semantic similarity between an opinion tag and the corresponding opinion cluster. (Best viewed in color.)
\Description

Semantic similarity.

Our contributions can be summarized as follows:

  • We propose a new task, abstractive opinion tagging, to generate opinion tags based on large volumes of item reviews.

  • We propose an abstractive framework AOT-Net to generate opinion tags based on reviews for a given item. AOT-Net has a sentence-level salience estimation component and a review clustering and ranking component to highlight salient reviews and rank reviews by clustering. We propose a rank-aware opinion tagging component with two alignment strategies to generate ranked opinion tags.

  • We collect a large-scale dataset, namely eComTag, consisting of item reviews and opinion tags, to support research into abstractive opinion tagging. Experimental results conducted on the eComTag dataset demonstrate the effectiveness of the AOT-Net framework.

2. Related Work

Related work comes in two categories: keyphrase generation and opinion summarization.

2.1. Keyphrase Generation

A lot of research has been conducted on generating keyphrases to summarize various types of text such as tweets, news reports, research articles, etc. (Meng et al., 2017; Chen et al., 2019; Liu et al., 2020b; Zhang et al., 2018; Wang et al., 2019b, a; Ray Chowdhury et al., 2019; Liu et al., 2020a). Early approaches to keyphrase generation extract important phrases from the document as the results. Sequence tagging models have been applied to identify keyphrases (Zhang et al., 2016; Luan et al., 2017; Gollapalli et al., 2017). Retrieval-based approaches utilize a two-step pipeline to extract and rank candidate keyphrases (Mihalcea and Tarau, 2004; Medelyan et al., 2009; Wang et al., 2016; Le et al., 2016). Sun et al. (2019) adopt an extractive graph-based approach, which applies a point network to generate a set of diverse keyphrases. Recently, abstractive approaches have also been explored. meng2017deep are the first to employ attention-based seq2seq framework with copy mechanism to conduct abstractive keyphrase generation. Chan et al. (2019) propose a reinforcement learning approach for neural keyphrase generation that encourages a model to generate both sufficient and accurate keyphrases. Wang et al. (2019a) propose a topic-aware neural keyphrase generation method to identify topic words.

Unlike the work listed above, which only considers keyphrase generation for single document, we consider opinion tagging from multiple documents, that is, from all of the reviews for a given item.

2.2. Opinion Summarization

Opinion summarization has become an emerging research topic in recent years. Early studies on opinion summarization focus on extracting salient sentences from the original review text (Hu and Liu, 2004; Carenini et al., 2006; Lu et al., 2009; Ganesan et al., 2010; Xiong and Litman, 2014; Angelidis and Lapata, 2018): Hu and Liu (2004) identify item features mentioned in the reviews and then extract opinion sentences for the identified features. Xiong and Litman (2014) utilize unsupervised learning methods to extract review summaries by exploiting review helpfulness ratings. Angelidis and Lapata (2018) present a weakly supervised neural framework for aspect-based opinion summarization by combining the tasks of aspect extracting and sentiment predicting. Reflecting the most representative opinions from reviewers, many recent studies have shown that abstractive approaches are more appropriate for summarizing review text (Carenini et al., 2013; Gerani et al., 2014; Fabbrizio et al., 2014; Li et al., 2019b; Tay, 2019): Gerani et al. (2014) utilize a template filling strategy to indirectly generate a review summary; Wang and Ling (2016) apply an attention-based encoder-decoder framework to generate an abstractive summary for opinionated documents. The main objective of the above summarization approaches is to generate coherent sentences to summarize opinions.

In contrast, we propose the abstractive opinion tagging task so as to generate opinion tags from a large number of user-generated reviews. In our scenario, opinion tags are more concise but without loss of essential information; they should help users comprehend reviews quickly and conveniently (Li et al., 2017).

3. Problem Formulation

Before detailing our proposed method, AOT-Net, we first formulate the abstractive opinion tagging problem. We use bold lowercase characters to denote vectors, and bold upper case characters to denote matrices. We write W and b for a projection matrix and a bias vector in a neural network layer, respectively. Suppose that there are MM reviews for a given item. We denote each review Xi,1iMX_{i},1\leq i\leq M as a sequence of words, i.e., Xi=[x1,,xLxi]X_{i}=[x_{1},\ldots,x_{L_{x_{i}}}], where LxiL_{x_{i}} denotes the number of words in XiX_{i}. In the same way, we assume that NN opinion tags exist for a given item. We denote each opinion tag Yj,1jNY_{j},1\leq j\leq N as a sequence of words, i.e., Yj=[y1,,yLyj]Y_{j}=[y_{1},\ldots,y_{L_{y_{j}}}], where LyjL_{y_{j}} refers to the number of words in YjY_{j}. Given a set of reviews 𝒳={X1,X2,,XM}\mathcal{X}=\left\{X_{1},X_{2},\ldots,X_{M}\right\}, the task of abstractive opinion tagging is to generate a sequence of opinion tags 𝒴=[Y1,Y2,,YN]\mathcal{Y}=[Y_{1},Y_{2},\ldots,Y_{N}].

4. Method

Refer to caption
Figure 4. Our proposed framework AOT-Net for abstractive opinion tagging.
\Description

Model overview.

4.1. Overview

Before providing the details of AOT-Net, our proposed method for abstractive opinion tagging, we first provide an overview in Figure 4. We divide AOT-Net into three main phases: (A) sentence-level salience estimation; (B) review clustering and ranking; and (C) rank-aware opinion tagging. For a set of reviews 𝒳\mathcal{X} about a given item, in phase A, we derive a salience score ziz_{i} for each review Xi𝒳X_{i}\in\mathcal{X} to estimate its item-aware salience information. In phase B, reviews are first encoded into vector representations and weighted by corresponding salience scores. The weighted vector representations of reviews are clustered into KK opinion clusters {C1,,CK}\{C_{1},\ldots,C_{K}\} and ranked by cluster size. Reviews within each cluster are ranked by their distance to the cluster center. Then we flatten ranked reviews into word-level vector representations. In phase C, we use review representations to generate ranked opinion tags via two alignment constraints, i.e., alignment features and alignment loss. We jointly learn all components in a multi-task learning framework.

4.2. A: Sentence-level Salience Estimation

The aim of the sentence-level salience estimation component is to compute a salience score for each review Xi𝒳X_{i}\in\mathcal{X}. We design a sentence-level self-attention mechanism to highlight item-related reviews and reduce noise. First, the component reads each review sequence Xi=[x1,,xLxi]X_{i}=[x_{1},\ldots,x_{L_{x_{i}}}] and uses a lookup table to convert each review word xpx_{p} to a word embedding vector 𝐱pde\mathbf{x}_{p}\in\mathbb{R}^{d_{e}}. To incorporate the contextual information of the review text into the representation of each word, we feed each embedding vector 𝐱p\mathbf{x}_{p} to a bi-directional Gated-Recurrent Unit (GRU) (Cho et al., 2014) to learn a hidden representation 𝐡pd\mathbf{h}_{p}\in\mathbb{R}^{d}. More specifically, a bi-directional GRU consists of a forward GRU that reads the embedding sequence from 𝐱1\mathbf{x}_{1} to 𝐱Lxi\mathbf{x}_{L_{x_{i}}} and a backward GRU that reads from 𝐱Lxi\mathbf{x}_{L_{x_{i}}} to 𝐱1\mathbf{x}_{1}:

(1) 𝐡p\displaystyle\overrightarrow{\mathbf{h}}_{p} =GRUf(𝐱p,𝐡p1),\displaystyle=\text{GRU}_{f}(\mathbf{x}_{p},\overrightarrow{\mathbf{h}}_{{p-1}}),
(2) 𝐡p\displaystyle\overleftarrow{\mathbf{h}}_{p} =GRUb(𝐱p,𝐡p+1),\displaystyle=\text{GRU}_{b}(\mathbf{x}_{p},\overleftarrow{\mathbf{h}}_{{p+1}}),

where 𝐡pd/2\overrightarrow{\mathbf{h}}_{p}\in\mathbb{R}^{d/2} and 𝐡pd/2\overleftarrow{\mathbf{h}}_{p}\in\mathbb{R}^{d/2} denote the hidden states of the forward GRUf\text{GRU}_{f} and backward GRUb\text{GRU}_{b}, respectively. We concatenate the last forward hidden state 𝐡Lxi\overrightarrow{\mathbf{h}}_{L_{x_{i}}} and last backward hidden state 𝐡1\overleftarrow{\mathbf{h}}_{1} to form the hidden representation for review XiX_{i}, i.e., 𝐡Xi=[𝐡Lxi;𝐡1]\mathbf{h}_{X_{i}}=[\overrightarrow{\mathbf{h}}_{L_{x_{i}}};\overleftarrow{\mathbf{h}}_{1}].

Next, we pass hidden representations of all reviews {𝐡Xi}i=1:M\{\mathbf{h}_{X_{i}}\}_{i=1:M} to a self-attention layer to model more complex interactions among the reviews. We propose a salience context vector ci\textbf{c}_{i} for each review XiX_{i} to denote the shared information from other reviews:

(3) 𝐪i\displaystyle\mathbf{q}_{i} =𝐖q𝐡Xi,𝐤i=𝐖k𝐡Xi,𝐯i=𝐖v𝐡Xi,\displaystyle=\mathbf{W}_{q}\mathbf{h}_{X_{i}},\quad\mathbf{k}_{i}=\mathbf{W}_{k}\mathbf{h}_{X_{i}},\quad\mathbf{v}_{i}=\mathbf{W}_{v}\mathbf{h}_{X_{i}},
(4) 𝐜i\displaystyle\mathbf{c}_{i} =i=1Mexp(𝐪iT𝐤i)o=1Mexp(𝐪iT𝐤o)𝐯i,\displaystyle=\sum_{i^{\prime}=1}^{M}\frac{\text{exp}(\mathbf{q}_{i}^{T}\mathbf{k}_{i^{\prime}})}{\sum_{o=1}^{M}\text{exp}(\mathbf{q}_{i}^{T}\mathbf{k}_{o})}\mathbf{v}_{i^{\prime}}\text{,}

where qi,ki,vid×d\textbf{q}_{i},\textbf{k}_{i},\textbf{v}_{i}\in\mathbb{R}^{d\times d} refer to query, key, and value vectors, respectively. These vectors are linearly transformed from review hidden representation 𝐡Xi\mathbf{h}_{X_{i}}. Then we apply a residual connection from the review hidden representation 𝐡Xi\mathbf{h}_{X_{i}} to the salience context vector 𝐜i\mathbf{c}_{i} and feed it to a two-layer feed-forward network with a ReLU as the activation function:

(5) 𝐡Xi\displaystyle\mathbf{h}^{\prime}_{X_{i}} =𝐖s1(ReLU(𝐖s2(𝐡Xi+𝐜i))).\displaystyle=\mathbf{W}_{s1}(\text{ReLU}(\mathbf{W}_{s2}(\mathbf{h}_{X_{i}}+\mathbf{c}_{i}))).

Given the context-enhanced review hidden representation 𝐡Xi\mathbf{h}^{\prime}_{X_{i}}, we can derive the real-valued salience score ziz_{i} for XiX_{i}:

(6) zi=σ(𝐖s𝐡Xi+bs),\displaystyle z_{i}=\sigma(\mathbf{W}_{s}\mathbf{h}^{\prime}_{X_{i}}+b_{s}),

where 𝐖sd\mathbf{W}_{s}\in\mathbb{R}^{d} and bsb_{s}\in\mathbb{R}. σ()\sigma(\cdot) is the sigmoid activation function. The salience scores {s1,,sM}\{s^{\prime}_{1},\ldots,s^{\prime}_{M}\} serve as the salience weights of review representations for the later review clustering and ranking component.

In order to optimize the sentence-level salience estimating component, we manually label a binary salience label zi{0,1}z^{*}_{i}\in\{0,1\} for each review XiX_{i}, where 1 denotes “item-related” whereas 0 denotes “noisy”. Then, sentence-level salience estimation component is trained by minimizing the cross-entropy loss function:

(7) cla\displaystyle\mathcal{L}_{cla} =1Mi=1Mzilog(zi)+(1zi)log(1zi).\displaystyle=-\frac{1}{M}\sum_{i=1}^{M}z^{*}_{i}\log(z_{i})+(1-z^{*}_{i})\log(1-z_{i}).

4.3. B: Review Clustering and Ranking

We propose a review clustering and ranking component to learn the ranks of reviews by grouping reviews into ranked opinion clusters, which is the main prerequisite to accurately generate ranked opinion tags.

We use a standard transformer encoder (Vaswani et al., 2017) to convert each review XiX_{i} into vector representations. Following Vaswani et al. (2017), we first map each word xXix\in X_{i} into its vectorized representation xde\textbf{x}\in\mathbb{R}^{d_{e}} using a word embedding layer and a positional embedding layer, as shown in the following equation:

(8) x =Embed(x)+Pos(x).\displaystyle=\text{Embed}(x)+\text{Pos}(x).

Then we use a transformer layer to encode global contextual information for words within XiX_{i}.

(9) 𝐠\displaystyle\mathbf{g} =LayerNorm(𝐱n1+MHAtt(𝐱n1)),\displaystyle=\text{LayerNorm}(\mathbf{x}^{n-1}+\text{MHAtt}(\mathbf{x}^{n-1})),
(10) 𝐱n\displaystyle\mathbf{x}^{n} =LayerNorm(𝐠+FFN(𝐠)),\displaystyle=\text{LayerNorm}(\mathbf{g}+\text{FFN}(\mathbf{g})),

where LayerNorm is the layer normalization proposed by Ba et al. (2016); MHAtt is the multi-head attention mechanism introduced by Vaswani et al. (2017); FFN is a two-layer feed-forward network with ReLU as hidden activation function; and nn is the number of transformer block layers. The word-level vector representations of review XiX_{i} are 𝐗i=[𝐱1,,𝐱Lxi]=[𝐱1n,,𝐱Lxin]\mathbf{X}_{i}=[\mathbf{x}_{1},\ldots,\mathbf{x}_{L_{x_{i}}}]=[\mathbf{x}^{n}_{1},\ldots,\mathbf{x}^{n}_{L_{x_{i}}}]. To obtain the sentence representation 𝐗~i\tilde{\mathbf{X}}_{i} of review XiX_{i}, we perform a hierarchical pooling operation (Shen et al., 2018) across its different words. The hierarchical pooling mechanism can preserve word order information and has demonstrated superior performance over mean-pooling or max-pooling on many semantic analysis tasks (Shen et al., 2018).

To highlight the item-aware reviews and ignore noisy reviews, the sentence representations of reviews are first weighted by the corresponding salience scores, i.e., 𝐗~i=zi𝐗~i\mathbf{\tilde{X}}^{\prime}_{i}=z_{i}\mathbf{\tilde{X}}_{i}. Then we apply the kk-means (MacQueen, 1967) algorithm on {𝐗~1,,𝐗~M}\{\mathbf{\tilde{X}}^{\prime}_{1},\ldots,\mathbf{\tilde{X}}^{\prime}_{M}\} to group corresponding {𝐗1,,𝐗M}\{\mathbf{X}_{1},\ldots,\mathbf{X}_{M}\} into KK opinion clusters.333KK is manually assigned according to the number of reviews. If M200M\leq 200, K=M20K=\lceil\frac{M}{20}\rceil, otherwise, K=20K=20. We rank opinion clusters from the largest (representing the highest number of reviews) to the smallest, denoted as [C1,,CK][C_{1},\ldots,C_{K}]. For each cluster, we rank reviews from the nearest (representing the distance between review and cluster center) to the farthest. Finally, we obtain a ranked list of reviews, represented as [𝐗1,1,,𝐗1,L1,,𝐗K,1,,𝐗K,LK][\mathbf{X}_{1,1},\ldots,\mathbf{X}_{1,{L_{1}}},\ldots,\mathbf{X}_{K,{1}},\ldots,\mathbf{X}_{K,{L_{K}}}], where 𝐗k,i\mathbf{X}_{k,i} is the ii-th review in the kk-th opinion cluster and LkL_{k} is the number of reviews in the kk-th opinion cluster.

We sequentially concatenate the vector representations of reviews in the ranked list and derive the final word-level representations, i.e., 𝐗=[𝐱1,1,,𝐱1,c1,,𝐱K,1,,𝐱K,cK]\mathbf{X}=[\mathbf{x}_{1,{1}},\ldots,\mathbf{x}_{1,{c_{1}}},\ldots,\mathbf{x}_{K,{1}},\ldots,\mathbf{x}_{K,{c_{K}}}].444In this paper, we only focus on the ranks of opinion clusters. We regard reviews or words in the same opinion cluster as equally important. Similarly, xk,px_{k,p} is the pp-th word in the kk-th opinion cluster and ckc_{k} is the number of words in the kk-th opinion cluster. Next, 𝐗\mathbf{X} will serve as the memory bank for the later rank-aware opinion tagging component.

4.4. C: Rank-aware Opinion Tagging

Accurately generating opinion tags with ranks is challenging. Therefore, we propose a rank-aware opinion tagging component to generate ranked opinion tags.

In the training stage, we add a start token BOS at the beginning of each opinion tag, i.e., Yj=[BOS;Yj]Y^{\prime}_{j}=[\texttt{BOS};Y_{j}] where [;][;] is the concatenation function. Then we concatenate all opinion tags into a sequence of words: Y=[Y1;;YN]=[y1,0,,y1,Ly1,,yN,0,,yN,LyN]Y=[Y^{\prime}_{1};\ldots;Y^{\prime}_{N}]=[y_{1,0},\ldots,y_{1,L_{y_{1}}},\ldots,y_{N,0},\ldots,y_{N,L_{y_{N}}}], where yj,qy_{j,q} is the qq-th word of opinion tag YjY^{\prime}_{j}, yj,0y_{j,0} is the BOS token of the jj-th opinion tag. Our decoder, i.e., the rank-aware opinion tagging component, follows the transformer architecture (Vaswani et al., 2017).

From the data analysis in Figure 3, we know that the ranks of opinion tags have a strong correlation with the ranks of opinion clusters. The jj-th opinion tag may pay attention to the jj-th opinion cluster and its surrounding neighbors simultaneously. Therefore, for each opinion tag YjY^{\prime}_{j}, we hypothesize that the model needs to focus on the FF most related opinion clusters.555FF is a hyperparameter. FF opinion clusters mean the jj-th opinion clusters and its surrounding neighbors. If an opinion cluster belongs to the FF focused opinion clusters, we call it a Focused Opinion Cluster (FOC). Otherwise, we call it an Outer Opinion Cluster (OOC). We design two alignment strategies between between FOCs and opinion tags, i.e., incorporating alignment features and enforcing alignment loss, which help to improve the generation of ranked opinion tags.

Alignment Feature

Intuitively, the opinion tags and their FOCs are semantically similar in the vector space. To help the model capture the alignment between opinion tags and their FOCs, we incorporate alignment features Aln()\text{Aln}(\cdot) into the word-level representations of the opinion tags and opinion clusters. Formally, Aln()\text{Aln}(\cdot) is a function that maps an integer into a vector. Now, we explain how we will use it to represent the ranks of opinion tags and opinion clusters. First, we incorporate the alignment feature into the representations of opinion tags. The words in the jj-th opinion tag have the same rank jj where 1jN1\leq j\leq N. For each word yj,qy_{j,q}, the vectorized representation is the sum of the alignment feature Aln(j)\text{Aln}(j), the word embedding Embed(yj,q)\text{Embed}(y_{j,q}), and the positional embedding Pos(yj,q)\text{Pos}(y_{j,q}):

(11) 𝐲j,q\displaystyle\mathbf{y}_{j,q} =𝐖rtAln(j)+Embed(yj,q)+Pos(yj,q),\displaystyle=\mathbf{W}_{rt}\text{Aln}(j)+\text{Embed}(y_{j,q})+\text{Pos}(y_{j,q}),

where 𝐖rtde×de\mathbf{W}_{rt}\in\mathbb{R}^{d_{e}\times d_{e}} is a trainable model parameter.

Refer to caption
Figure 5. Representations attached with alignment features for opinion tags and opinion clusters.

For the target jj-th opinion tag, the ranks of FOCs are set to jj as well, while the ranks of OOCs are set to 0. Then we enhance the word vectors in X with alignment features to capture the alignment between reviews and opinion tags:

(12) alnxi,p={Aln(j),xi,pFOCsAln(0),xi,pOOCs.\displaystyle\text{aln}_{x_{i,p}}=\begin{cases}\text{Aln}(j),&x_{i,p}\in\text{FOC}s\\ \text{Aln}(0),&x_{i,p}\in\text{OOC}s.\end{cases}

Next, the alignment features [alnx1,1,,alnxK,cK][\text{aln}_{x_{1,1}},\ldots,\text{aln}_{x_{K,c_{K}}}] are added into the review representations [𝐱1,1,,𝐱K,cK][\mathbf{x}_{1,1},\ldots,\mathbf{x}_{K,c_{K}}] to obtain the alignment-enhanced representations 𝐑=[r1,1,,rK,cK]\mathbf{R}=[r_{1,1},\ldots,r_{K,c_{K}}] for the words in opinion clusters:

(13) 𝐫i,p\displaystyle\mathbf{r}_{i,p} =𝐖rcalnxi,p+𝐱i,p,\displaystyle=\mathbf{W}_{rc}\text{aln}_{x_{i,p}}+\mathbf{x}_{i,p}\text{,}

where 𝐖rcde\mathbf{W}_{rc}\in\mathbb{R}^{d_{e}} is a model parameter. Figure 5 summarizes the construction.

At each decoding step qq of the jj-th opinion tag, the decoder reads the embeddings of the last prediction 𝐲j,q1\mathbf{y}_{j,q-1}.

(14) 𝐲j,q\displaystyle\mathbf{y}_{j,q} =transformer_decoder(𝐲j,q1),\displaystyle=\text{transformer\_decoder}(\mathbf{y}_{j,q-1}),

where 𝐲j,qde\mathbf{y}_{j,q}\in\mathbb{R}^{d_{e}} is the target word representation. Next, we introduce our decoder in detail.

To capture semantic and alignment information from the opinion clusters, a multi-head cross-attention MHAtt (Vaswani et al., 2017) is applied to compute the attention score [α1,1,,αk,Lk][\alpha_{1,1},\ldots,\alpha_{k,L_{k}}] between the last prediction 𝐲j,q1\mathbf{y}_{j,q-1} and [𝐫1,1,,𝐫k,ck][\mathbf{r}_{1,1},\ldots,\mathbf{r}_{k,c_{k}}]:

(15) 𝐮j,q1z\displaystyle\mathbf{u}_{j,q-1}^{z} =𝐖az𝐲j,q1\displaystyle=\mathbf{W}_{a}^{z}\mathbf{y}_{j,q-1}
(16) 𝐤i,pz\displaystyle\mathbf{k}_{i,p}^{z} =𝐖bz𝐫i,p\displaystyle=\mathbf{W}_{b}^{z}\mathbf{r}_{i,p}
(17) αi,pz\displaystyle\alpha_{i,p}^{z} =exp(𝐮j,q1zT𝐤i,pz)i=1Kp=1ciexp(𝐮j,q1zT𝐤i,pz),\displaystyle=\frac{\exp({\mathbf{u}_{j,q-1}^{z}}^{T}\mathbf{k}_{i,p}^{z})}{\sum_{i^{\prime}=1}^{K}\sum_{p^{\prime}=1}^{c_{i^{\prime}}}\exp({\mathbf{u}_{j,q-1}^{z}}^{T}\mathbf{k}_{i^{\prime},p^{\prime}}^{z})},

where 𝐮j,q1zdh,𝐤i,pzdh\mathbf{u}_{j,q-1}^{z}\in\mathbb{R}^{d_{h}},\mathbf{k}_{i,p}^{z}\in\mathbb{R}^{d_{h}} are query and key vectors that are linearly transformed from 𝐲j,q1\mathbf{y}_{j,q-1} and 𝐫i,p\mathbf{r}_{i,p} as in (Vaswani et al., 2017); z{1,,nh}z\in\{1,\ldots,n_{h}\} indicates the zz-th head among nhn_{h} heads; dh=de/nhd_{h}=d_{e}/n_{h} is the dimension of each head.

The attention scores [αi,j1,,αi,jnh][\alpha_{i,j}^{1},\ldots,\alpha_{i,j}^{n_{h}}] are then used to compute an aggregated vector 𝐜j,q\mathbf{c}_{j,q} for target word yj,qy_{j,q}:

(18) cj,q=[i=1Kj=1ciαi,j1𝐫i,j;;i=1Kj=1ciαi,jnh𝐫i,j].\displaystyle c_{j,q}=[\sum_{i=1}^{K}\sum_{j=1}^{c_{i}}\alpha_{i,j}^{1}\mathbf{r}_{i,j};\ldots;\sum_{i=1}^{K}\sum_{j=1}^{c_{i}}\alpha_{i,j}^{n_{h}}\mathbf{r}_{i,j}].

Then we feed the last word representation 𝐲j,q1\mathbf{y}_{j,q-1} and vector 𝐜j,q\mathbf{c}_{j,q} to a two-layer feed-forward network with a ReLU as the activation function and a highway layer normalization on top:

(19) 𝐬j,q1\displaystyle\mathbf{s}_{j,q-1} =LayerNorm(𝐲j,q1n1+𝐜j,q)\displaystyle=\text{LayerNorm}(\mathbf{y}^{n-1}_{j,q-1}+\mathbf{c}_{j,q})
(20) 𝐲j,q1n\displaystyle\mathbf{y}^{n}_{j,q-1} =LayerNorm(𝐬j,q1+FFN(𝐬j,q1)),\displaystyle=\text{LayerNorm}(\mathbf{s}_{j,q-1}+\text{FFN}(\mathbf{s}_{j,q-1})),

where 𝐲j,q=𝐲j,q1n\mathbf{y}_{j,q}=\mathbf{y}^{n}_{j,q-1} is the target word representation. After that, we use 𝐲j,q\mathbf{y}_{j,q} to compute a probability distribution over the words in a predefined vocabulary 𝒱\mathcal{V}, as shown in the following equation:

(21) P𝒱(yj,q[y1,0,,yj,q1],𝒳)\displaystyle P_{\mathcal{V}}(y_{j,q}\mid[y_{1,0},\ldots,y_{j,q-1}],\mathcal{X}) =softmax(𝐖v𝐲j,q+𝐛v),\displaystyle=\text{softmax}(\mathbf{W}_{v}\mathbf{y}_{j,q}+\mathbf{b}_{v}),

where 𝐖𝐯|𝒱|×de\mathbf{W_{v}}\in\mathbb{R}^{|\mathcal{V}|\times d_{e}}, 𝐛v|𝒱|\mathbf{b}_{v}\in\mathbb{R}^{|\mathcal{V}|} are trainable parameters. To enable our model to generate out-of-vocabulary (OOV) words, we adopt the copy mechanism (See et al., 2017) to predict OOV words by directly copying words from the opinion clusters. We first compute a soft gate pgen[0,1]p_{gen}\in[0,1] between generating a word from the predefined vocabulary 𝒱\mathcal{V} and copying a word from the input reviews 𝒳\mathcal{X}:

(22) pgen\displaystyle p_{gen} =σ(𝐖𝐠𝐲p,q+bg),\displaystyle=\sigma(\mathbf{W_{g}}\mathbf{y}_{p,q}+b_{g}),

where 𝐖𝐠de\mathbf{W_{g}}\in\mathbb{R}^{d_{e}} and bgb_{g}\in\mathbb{R} are trainable parameters. Finally, we can derive the final next-word probability distribution P(yj,q)P(y_{j,q}):

(23) αi,p\displaystyle\alpha_{i,p} =znhαi,pznh,\displaystyle=\frac{\sum_{z}^{n_{h}}\alpha_{i,p}^{z}}{n_{h}},
(24) P(yj,q)\displaystyle P(y_{j,q}) =pgenP𝒱(yj,q)+(1pgen)i,p:xi,p=yj,qαi,p,\displaystyle=p_{gen}P_{\mathcal{V}}(y_{j,q})+(1-p_{gen})\sum_{i,p:x_{i,p}=y_{j,q}}\alpha_{i,p},

where we use P(yj,q)P(y_{j,q}) to denote P(yj,q[y1,0,,yj,q1],𝒳)P(y_{j,q}\mid[y_{1,0},\ldots,y_{j,q-1}],\mathcal{X}) for brevity. We use the negative log-likelihood of the ground-truth words yj,qy^{*}_{j,q} as the generation loss function:

(25) gen\displaystyle\mathcal{L}_{gen} =j=1Nq=1LyjlogP(yj,q[𝐲1,0,,𝐲j,q1],𝒳).\displaystyle=-\sum_{j=1}^{N}\sum_{q=1}^{L_{y_{j}}}\log P(y^{*}_{j,q}\mid[\mathbf{y}^{*}_{1,0},\ldots,\mathbf{y}^{*}_{j,q-1}],\mathcal{X}).

Alignment Loss

During the generation of the jj-th opinion tag, we enforce an alignment loss to help locate FOCs accurately. The model is explicitly taught to focus on FOCs and ignore OOCs in the attention αi,p\alpha_{i,p} via the following loss:

(26) aln=log(i,p:xi,pFOCsαi,pi,pαi,p)+log(i,p:xi,pOOCsαi,pi,pαi,p).\displaystyle\mbox{}\mathcal{L}_{aln}=-\log\left(\frac{\sum_{i,p:x_{i,p}\in\text{FOC}s}\alpha_{i,p}}{\sum_{i,p}\alpha_{i,p}}\right)+\log\left(\frac{\sum_{i,p:x_{i,p}\in\text{OOC}s}\alpha_{i,p}}{\sum_{i,p}\alpha_{i,p}}\right)\!.

4.5. Multi-task Training Objective

We adopt a multi-task learning framework to jointly minimize the salience classification loss, alignment loss, and generation loss. The objective function is:

(27) =λ1cla+λ2aln+λ3gen,\displaystyle\mathcal{L}=\lambda_{1}\mathcal{L}_{cla}+\lambda_{2}\mathcal{L}_{aln}+\lambda_{3}\mathcal{L}_{gen},

where λ1,λ2,λ3\lambda_{1},\lambda_{2},\lambda_{3} are hyper-parameters that control the weights of these three losses. We set λ1=λ2=λ3=1\lambda_{1}=\lambda_{2}=\lambda_{3}=1. Thus, each component of our joint model can be trained end-to-end.

5. Experimental Setup

We set up experiments to compare AOT-Net against a number of relevant baselines. We are interested in the overall performance of AOT-Net and in understanding the effectiveness of the salience estimation and ranking alignment.

5.1. Experiments

We report on five experiments. First, we compare AOT-Net against a number of baselines to assess its overall performance. Then we conduct ablation studies to analyze the influence of different components in AOT-Net as follows: (i) w/o SSEis AOT-Net without the sentence-level salience estimating component (SSE). (ii) w/o RCRis AOT-Net without the review clustering and ranking component (RCR). (iii) w/o AFis AOT-Net without alignment feature (AF). (iv) w/o ALis AOT-Net without alignment loss (AL). To further explore the effectiveness of the sentence-level self-attention mechanism in SSE, we consider AOT-RNN, the method only considers BiGRU in salience score prediction; whereas we write AOT-Embed for the method that employs MLP to replace BiGRU in SSE. Fourth, we analyze the performance of AOT-Net for different sizes of FOCs. Lastly, we provide a case study about abstractive opinion tagging.

5.2. Baselines

We compare AOT-Net with the following methods: (i) TF-IDFis an extractive approach that selects the important words as summary based on term frequency and inverse document frequency; (ii) TextRank(Mihalcea and Tarau, 2004)is an unsupervised algorithm based on weighted-graphs; (iii) RNNis a sequence to sequence model with attention implemented by bi-directional GRU layer (Cho et al., 2014); (iv) PG-Net(See et al., 2017)is a classical opinion summarization model based on the encoder-decoder framework with attention and copy mechanisms; and (v) the Transformer (Vaswani et al., 2017) is a Transformer-based encoder-decoder model with a copy mechanism, which is a strong baseline widely-adopt in opinion summarization.

5.3. The eComTag Dataset

Since there is no available opinion tagging dataset, we build a new one, named eComTag from several Chinese e-commerce websites. We collect nearly 112k items, sampling from different domains, including Cosmetic (37.43%37.43\%), Electronics (29.51%29.51\%), Books (10.57%10.57\%), Entertainment (8.23%8.23\%), Food (7.62%7.62\%), Sports (3.96%3.96\%), Clothes (3.16%3.16\%), Medical (1.62%1.62\%), and Furniture (0.33%0.33\%). For each domain, there are a set of reviews and a list of opinion tags. Since reviewers may comment on multiple aspects, e.g., “The dim sum tasted extremely fresh, and the price was quite reasonable!”, we split each review into sentences by punctuation. We use a sentence to denote a review in our paper. Then, we remove samples where the number of opinion tags is smaller than 4 or the number of reviews is fewer than 50. Finally, we construct eComTag with 50,068 item samples. Users may write reviews arbitrarily, which results in many meaningless expressions, such as “Love, love, LOVE this space!”, and “come with my boyfriend.” To teach our model to distinguish these noisy sentences, we annotate each review with a binary salience label via human judgment. If a review is item-related, we label the review as 1, otherwise 0. The salience labels are the supervision signals for the sentence-level salience estimating component.

Finally, each item sample consists of a set of reviews, a set of corresponding salience labels, and a sequence of opinion tags. For text preprocessing, we tokenize texts using the Jieba toolkit666https://github.com/fxsjy/jieba and maintain a 50k vocabulary. In eComTag, about 30% of samples have more than 10241024 words in reviews. We randomly split the dataset into training/validation/test sets with 8:1:1 ratio. The statistics are shown in Table 1. Specially, we define the present tag (Pr) as the exact tag that appears in reviews and absent tag (Ab) as the tag unseen in reviews. The proportion of absent tags is close to 75%, which further proves the necessity to apply abstractive methods on opinion tagging.

5.4. Evaluation Metrics

We employ two information retrieval metrics to evaluate the opinion tag generation: the macro F@k@{k} score and the normalized discounted cumulative gain (NDCG@k@{k}) score. Both are widely used to measure word overlap (Sun et al., 2019; Chan et al., 2020). To measure diversity of the generated opinion tags, we adopt the Distinct-2{2} score (Li et al., 2016) and a macro Unique-NN score. We compute Unique-NN as follows, Unique-N=i=1TNi/T\text{Unique-}N=\sum_{i=1}^{T}N_{i}/T, where NiN_{i} is the number of distinct opinion tags in the ii-th sample. Moreover, we design two metrics, Exact Rank Match (ERM) and Fuzzy Rank Match (FRM), to evaluate the rank accuracy of opinion tags. ERM is defined as the one-to-one exact match proportion between true tags and predicted tags. Inspired by the Embedding Score (Liu et al., 2016), FRM first maps predicted tags and the corresponding true tags into the same vector space, and then computes the average cosine similarity between their vector representations.

Table 1. Statistics of the eComTag dataset. “Pr” and “Ab” denote the proportion of present tags and absent tags respectively. “MTN” and “MTL” indicate max tag number and max tag length for a sample respectively.
Data Sample Pr Ab MTN MTL
Training 40,162 24.1% 75.9% 19 40
Validation 04,953 24.3% 75.7% 20 39
Test 04,953 24.1% 75.9% 14 32

5.5. Implementation Details

We adopt the Adam (Kingma and Ba, 2015) optimizer with settings {β1=0.9\beta_{1}=0.9,β2=0.999\beta_{2}=0.999,ϵ=108\epsilon=10^{-8},lr=104lr=10^{-4}} and we vary the learning rate following Vaswani et al. (2017). We add dropout (Srivastava et al., 2014) with keeping rate 0.8 and label smoothing (Szegedy et al., 2016) with smoothing factor 0.1. We use the Tencent AI Lab Chinese Embeddings777https://ai.tencent.com/ailab/nlp/embedding.html for initialization of the word embedding layers. The rest of the parameters are randomly initialized. The dimensions of the alignment feature, word embedding layers and positional embedding layers are set to 200. We set the batch size to 16 and use the validation loss for early stopping. When inference, we set the maximum decoding step as 50. We use a bidirectional GRU (Cho et al., 2014) with 2 layers to implement the sentence-level salience estimating component. All RNN-based models have 256 hidden units. All transformer-based models have 300 hidden units; the feed-forward hidden size is set to 50 for all layers. We set the FF in Section 4.4 to 33 (33 is the number of Focused Opinion Clusters at each decoding step) to ensure the target tag token have enough relevant reviews to reference and avoid introducing too much interference information simultaneously. AOT-Net was trained on a single Tesla V100 GPU and is implemented using PyTorch. All hyperparameters and models are selected on the validation set and the results are reported on the test set.

Table 2. Evaluation results on the eComTag dataset. Results in bold are leading results in terms of the corresponding metric.
Distinct-2
Models F@15{}_{1}@5 F@110{}_{1}@10 NDCG@5 NDCG@10 ERM FRM Micro Macro Unique-N
TF-IDF 0.0039 0.0038 0.0168 0.0169 0.16 0.19
TextRank 0.0019 0.0018 0.0091 0.0097 0.06 0.21
RNN 0.2895 0.2753 0.7383 0.7701 0.17 0.44 0.60 62.19 7.447
PG-Net 0.3138 0.2896 0.7600 0.8009 0.19 0.44 1.33 65.78 6.798
Transformer 0.2833 0.2756 0.6916 0.7483 0.25 0.59 1.57 89.23 8.851
AOT-Net 0.3529 0.3492 0.7473 0.8045 0.31 0.64 1.30 94.35 8.953
w/o SSE 0.2930 0.2822 0.7022 0.7563 0.25 0.61 1.11 91.85 8.957
w/o RCR 0.3434 0.3370 0.7353 0.7913 0.31 0.63 1.77 92.94 8.935
w/o AF 0.3141 0.3056 0.7194 0.7768 0.28 0.62 1.21 93.23 8.997
w/o AL 0.3406 0.3336 0.7322 0.7857 0.30 0.64 1.30 93.11 8.926

6. Results and Analysis

Overall evaluation results on generating opinion tags are listed in Table 2. We find that all the abstractive models significantly outperform all the traditional extractive baselines. Thus we conclude that informal and colloquial nature of user-generated reviews make item-related features indiscernible using the unsupervised extraction methods. As expected, we also find that PG-Net significantly outperforms RNN, which implies that the copying mechanism is useful for opinion summarization. We can see AOT-Net significantly outperforms baseline Transformer in terms of all metrics and achieves the best performance for most metrics. The results of our ablation studies are shown in the lower part of the Table 2. We observe that after removing the SSE component, performances of AOT-Net in terms of most metrics drops obviously. If we do not rank and group reviews into opinion clusters before decoding (i.e., w/o RCR), although the diversity metric has a slight increase, the rank accuracy of AOT-Net decreases as we anticipated. We also find that after removing alignment feature or alignment loss mechanisms in the decoder, the performance of both retrieval and rank accuracy metrics (i.e., F1F_{1} and ERM) degrades. We will conduct a detailed analysis of the individual components in the following sections.

6.1. Salience Estimation Analysis

As shown in Table 2, AOT-Net achieves a 17.117.1% and 2.722.72% increase over “w/o SSE” in terms of Micro-Distinct-2 and Macro-Distinct-2, respectively. Similar improvements can be observed for other metrics. This demonstrates that the predicted salience scores help AOT-Net to focus on more valuable reviews. To verify the effectiveness of SSE with more details, in Figure 6 we list the accuracy scores of AOT-Embed, AOT-RNN, and AOT-Net for sentence-level salience estimation. We find that AOT-Net outperforms both AOT-Embed and AOT-RNN, which verifies the effectiveness of BiGRU and self-attention mechanisms. AOT-RNN achieves 7.27.2% increase over AOT-Embed in terms of accuracy for all reviews, which verifies the advantage of BiGRU in representing review. In terms of accuracy, we find that AOT-Net gives a 0.40.4% and 5.15.1% increase over AOT-RNN for all reviews and item-related reviews, respectively. This indicates that AOT-Net benefits from self-attention mechanisms, which capture the shared information to distinguish item-related reviews.

Refer to caption
Figure 6. Accuracy values for sentence-level salience estimation.

6.2. Number of FOCs

To evaluate the effect of the number of FOCs on the performance of rank-aware opinion tagging, we examine the performance of AOT-Net with different values of FF (see Section 4.4) in terms of ERM, FRM, and Distinct-2, respectively. As shown in Table 3, F=1F=1 significantly decreases the model rank and diversity performance. This suggests that only focusing on a single opinion cluster ignores many related reviews. We also find that when F=5F=5, the performance of AOT-Net slightly decreases; AOT-Net achieves the best performance in terms of all metrics when F=3F=3. Hence, we infer that FF is a trade-off between focusing on relevant reviews and removing irrelevant noise.

Table 3. Performance on different numbers (FF) of FOCs.
Models ERM FRM Macro Distinct-2
FF=1 0.30 0.62 93.24
FF=3 0.31 0.64 94.35
FF=5 0.30 0.63 93.32
Refer to caption
Figure 7. The transition of reviews attention distribution between tags computed by AOT-Net. Different colors correspond to different tags. We only depict top@5top@5 attention probabilities for brevity. (Best viewed in color).
\Description

An example of cross-attention transit.

6.3. Case Study

Figure 7 shows an example illustrating the 5 highest attention weights αi,p\alpha_{i,p} during the generation of opinion tags. We see that the model transits its focus smoothly from the first review sentence to later review sentences. Sometimes, the model may focus on the same review for two opinion tags, such as “The service (11st tag) is good and hairstyle (22nd tag) is good.” The first three predicted tags were completely accurate. However, the transformer only predicted the first tag accurately. To validate the effectiveness of the alignment loss, we calculate i,p:xi,pFOCs\sum_{i,p:x_{i,p}\in\text{FOCs}} and i,p:xi,pOOCs\sum_{i,p:x_{i,p}\in\text{OOCs}} for all items in the test set. Results show that i,p:xi,pFOCs\sum_{i,p:x_{i,p}\in\text{FOCs}} and i,p:xi,pOOCs\sum_{i,p:x_{i,p}\in\text{OOCs}} in AOT-Net are 0.73040.7304 and 0.26960.2696 on average. However, for AOT-Net w/o AL, i,p:xi,pFOCs\sum_{i,p:x_{i,p}\in\text{FOCs}} and i,p:xi,pOOCs\sum_{i,p:x_{i,p}\in\text{OOCs}} are 0.65090.6509 and 0.34910.3491, respectively. This phenomenon indicates that the alignment feature by itself is not enough to force AOT-Net to focus on FOCs. In particular, we compare the ranks of opinion tags generated by the transformer and AOT-Net respectively. For the transformer, items where the first 3 opinion tags have accurate ranks only account for near 6.11%, while for AOT-Net, the proportion can reach 9.68%. Thus, we conclude that the alignment feature and alignment loss in AOT-Net are helpful to capture the ranks of the opinion tags.

7. Conclusion and Future Work

In this paper, we have proposed the abstractive opinion tagging task, which aims to automatically generate a ranked list of opinion tags from a large number of reviews. We have proposed a rank-aware abstractive opinion tagging framework (AOT-Net) that includes a sentence-level salience estimating component, a review clustering and ranking component, and a rank-aware opinion tagging component. To validate the effectiveness of AOT-Net, we conduct extensive experiments on a newly collected real-world dataset, eComTag. Experiments show that AOT-Net achieves state-of-the-art performance on the abstractive opinion tagging task. AOT-Net has two main advantages over previous work. On the one hand, it generates more concise opinion tags; on the other hand, the ranked lists of generated opinion tags help users distinguish products with very similar aspects. Our work provides a plausible solution to greatly reduce human annotation costs for online e-commerce opinion tagging. Although we focused mostly on e-commerce portals, our methods are also broadly applicable to other settings with opinionated content, such as microblogs.

Limitations of our work include its low efficiency and coarse-grained salience estimation. As to our future work, we will adopt the deep clustering network instead of kk-means algorithm. Also, pre-trained language models could provide more power to enhance our sentence salience estimation. Few-shot learning for handling unbalanced review distributions among different domains could be another direction. It will be also interesting to explore user interactions with AOT-Net to generate personalized opinion tags in the future.

Code and Data

The source code and dataset used in this paper are available at https://github.com/qtli/AOT.

Acknowledgements.
We thank our reviewers for valuable feedback. This work was supported by the National Key R&D Program of China with grant No. 2020YFB1406704, the Natural Science Foundation of China (61972234, 61902219, 61672324, 61672322, 62072279), the Key Scientific and Technological Innovation Program of Shandong Province (2019JZZY010129), the Tencent AI Lab Rhino-Bird Focused Research Program (JR201932), the Fundamental Research Funds of Shandong University, the Foundation of State Key Laboratory of Cognitive Intelligence, iFLYTEK, P.R. China (COGOSC-20190003). All content represents the opinion of the authors, which is not necessarily shared or endorsed by their respective employers and/or sponsors.

References

  • (1)
  • Amplayo and Lapata (2019) Reinald Kim Amplayo and Mirella Lapata. 2019. Informative and Controllable Opinion Summarization. CoRR abs/1909.02322 (2019).
  • Amplayo and Lapata (2020) Reinald Kim Amplayo and Mirella Lapata. 2020. Unsupervised Opinion Summarization with Noising and Denoising. In ACL. 1934–1945.
  • Angelidis and Lapata (2018) Stefanos Angelidis and Mirella Lapata. 2018. Summarizing Opinions: Aspect Extraction Meets Sentiment Prediction and They Are Both Weakly Supervised. In EMNLP. 3675–3686.
  • Ba et al. (2016) Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normalization. CoRR abs/1607.06450 (2016).
  • Brazinskas et al. (2020) Arthur Brazinskas, Mirella Lapata, and Ivan Titov. 2020. Unsupervised Opinion Summarization as Copycat-Review Generation. In ACL. 5151–5169.
  • Carenini et al. (2013) Giuseppe Carenini, Jackie Chi Kit Cheung, and Adam Pauls. 2013. Multi-Document Summarization of Evaluative Text. Comput. Intell. 29, 4 (2013), 545–576.
  • Carenini et al. (2006) Giuseppe Carenini, Raymond T. Ng, and Adam Pauls. 2006. Multi-Document Summarization of Evaluative Text. In EACL. The Association for Computer Linguistics.
  • Carmeli et al. (2020) Nofar Carmeli, Xiaolan Wang, Yoshihiko Suhara, Stefanos Angelidis, Yuliang Li, Jinfeng Li, and Wang-Chiew Tan. 2020. ExplainIt: Explainable Review Summarization with Opinion Causality Graphs. CoRR abs/2006.00119 (2020).
  • Chan et al. (2020) Hou Pong Chan, Wang Chen, and Irwin King. 2020. A Unified Dual-view Model for Review Summarization and Sentiment Classification with Inconsistency Loss. In SIGIR. 1191–1200.
  • Chan et al. (2019) Hou Pong Chan, Wang Chen, Lu Wang, and Irwin King. 2019. Neural Keyphrase Generation via Reinforcement Learning with Adaptive Rewards. In ACL. 2163–2174.
  • Chen et al. (2019) Wang Chen, Yifan Gao, Jiani Zhang, Irwin King, and Michael R Lyu. 2019. Title-Guided Encoding for Keyphrase Generation. In AAAI, Vol. 33. 6268–6275.
  • Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In EMNLP. 1724–1734.
  • Elad et al. (2019) Guy Elad, Ido Guy, Slava Novgorodov, Benny Kimelfeld, and Kira Radinsky. 2019. Learning to Generate Personalized Product Descriptions. In CIKM. 389–398.
  • Fabbrizio et al. (2014) Giuseppe Di Fabbrizio, Amanda Stent, and Robert J. Gaizauskas. 2014. A Hybrid Approach to Multi-document Summarization of Opinions in Reviews. In INLG. 54–63.
  • Ganesan et al. (2010) Kavita Ganesan, ChengXiang Zhai, and Jiawei Han. 2010. Opinosis: A Graph Based Approach to Abstractive Summarization of Highly Redundant Opinions. In COLING. 340–348.
  • Gao et al. (2019) Shen Gao, Zhaochun Ren, Yihong Eric Zhao, Dongyan Zhao, Dawei Yin, and Rui Yan. 2019. Product-Aware Answer Generation in E-Commerce Question-Answering. In WSDM. 429–437.
  • Gerani et al. (2014) Shima Gerani, Yashar Mehdad, Giuseppe Carenini, Raymond T. Ng, and Bita Nejat. 2014. Abstractive Summarization of Product Reviews Using Discourse Structure. In EMNLP. 1602–1613.
  • Gollapalli et al. (2017) Sujatha Das Gollapalli, Xiao-Li Li, and Peng Yang. 2017. Incorporating Expert Knowledge into Keyphrase Extraction. In AAAI.
  • Hu and Liu (2004) Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In SIGKDD. ACM, 168–177.
  • Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR.
  • Le et al. (2016) Tho Thi Ngoc Le, Minh Le Nguyen, and Akira Shimazu. 2016. Unsupervised Keyphrase Extraction: Introducing New Kinds of Words to Keyphrases. In AI 2016 (LNCS), Vol. 9992. 665–671.
  • Li et al. (2016) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A Diversity-Promoting Objective Function for Neural Conversation Models. In NAACL. 110–119.
  • Li et al. (2019b) Junjie Li, Xuepeng Wang, Dawei Yin, and Chengqing Zong. 2019b. Attribute-aware Sequence Network for Review Summarization. In EMNLP-IJCNLP. 2998–3008.
  • Li et al. (2020) Pengyuan Li, Lei Huang, and Guang-jie Ren. 2020. Topic Detection and Summarization of User Reviews. CoRR abs/2006.00148 (2020).
  • Li et al. (2019a) Piji Li, Zihao Wang, Lidong Bing, and Wai Lam. 2019a. Persona-Aware Tips Generation. In The World Wide Web Conference. 1006–1016.
  • Li et al. (2017) Piji Li, Zihao Wang, Zhaochun Ren, Lidong Bing, and Wai Lam. 2017. Neural Rating Regression with Abstractive Tips Generation for Recommendation. In SIGIR. 345–354.
  • Liu et al. (2016) Chia-Wei Liu, Ryan Lowe, Iulian Serban, Michael Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation. In EMNLP. 2122–2132.
  • Liu et al. (2020a) Dayiheng Liu, Yeyun Gong, Jie Fu, Wei Liu, Yu Yan, Bo Shao, Daxin Jiang, Jiancheng Lv, and Nan Duan. 2020a. Diverse, Controllable, and Keyphrase-Aware: A Corpus and Method for News Multi-Headline Generation. arXiv preprint arXiv:2004.03875 (2020).
  • Liu et al. (2020b) Rui Liu, Zheng Lin, and Weiping Wang. 2020b. Keyphrase Prediction With Pre-trained Language Model. arXiv preprint arXiv:2004.10462 (2020).
  • Lu et al. (2009) Yue Lu, ChengXiang Zhai, and Neel Sundaresan. 2009. Rated Aspect Summarization of Short Comments. In WWW. 131–140.
  • Luan et al. (2017) Yi Luan, Mari Ostendorf, and Hannaneh Hajishirzi. 2017. Scientific Information Extraction with Semi-supervised Neural Tagging. In EMNLP.
  • MacQueen (1967) James MacQueen. 1967. Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1. Oakland, CA, USA, 281–297.
  • Medelyan et al. (2009) Olena Medelyan, Eibe Frank, and Ian H Witten. 2009. Human-competitive Tagging using Automatic Keyphrase Extraction. In EMNLP. 1318–1327.
  • Meng et al. (2017) Rui Meng, Sanqiang Zhao, Shuguang Han, Daqing He, Peter Brusilovsky, and Yu Chi. 2017. Deep Keyphrase Generation. CoRR abs/1704.06879 (2017).
  • Mihalcea and Tarau (2004) Rada Mihalcea and Paul Tarau. 2004. Textrank: Bringing order into text. In EMNLP. 404–411.
  • Novgorodov et al. (2019) Slava Novgorodov, Ido Guy, Guy Elad, and Kira Radinsky. 2019. Generating Product Descriptions from User Reviews. In WWW. 1354–1364.
  • Ray Chowdhury et al. (2019) Jishnu Ray Chowdhury, Cornelia Caragea, and Doina Caragea. 2019. Keyphrase extraction from disaster-related tweets. In WWW. 1555–1566.
  • See et al. (2017) Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get To The Point: Summarization with Pointer-Generator Networks. In ACL. 1073–1083.
  • Shen et al. (2018) Dinghan Shen, Guoyin Wang, Wenlin Wang, Martin Renqiang Min, Qinliang Su, Yizhe Zhang, Chunyuan Li, Ricardo Henao, and Lawrence Carin. 2018. Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms. In ACL. 440–450.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res. 15, 1 (2014), 1929–1958.
  • Suhara et al. (2020) Yoshihiko Suhara, Xiaolan Wang, Stefanos Angelidis, and Wang-Chiew Tan. 2020. OpinionDigest: A Simple Framework for Opinion Summarization. In ACL. 5789–5798.
  • Sun et al. (2019) Zhiqing Sun, Jian Tang, Pan Du, Zhi-Hong Deng, and Jian-Yun Nie. 2019. DivGraphPointer: A Graph Pointer Network for Extracting Diverse Keyphrases. In SIGIR. 755–764.
  • Szegedy et al. (2016) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2016. Rethinking the Inception Architecture for Computer Vision. In CVPR. 2818–2826.
  • Tang et al. (2015) Duyu Tang, Bing Qin, Ting Liu, and Yuekui Yang. 2015. User Modeling with Neural Network for Review Rating Prediction. In IJCAI.
  • Tay (2019) Wenyi Tay. 2019. Not All Reviews Are Equal: Towards Addressing Reviewer Biases for Opinion Summarization. In ACL. 34–42.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS. 6000–6010.
  • Wang and Ling (2016) Lu Wang and Wang Ling. 2016. Neural Network-Based Abstract Generation for Opinions and Arguments. In NAACL. 47–57.
  • Wang et al. (2016) Minmei Wang, Bo Zhao, and Yihua Huang. 2016. PTR: Phrase-based Topical Ranking for Automatic Keyphrase Extraction in Scientific Publications. In ICONIP. 120–128.
  • Wang et al. (2019a) Yue Wang, Jing Li, Hou Pong Chan, Irwin King, Michael R. Lyu, and Shuming Shi. 2019a. Topic-Aware Neural Keyphrase Generation for Social Media Language. In ACL. 2516–2526.
  • Wang et al. (2019b) Yue Wang, Jing Li, Irwin King, Michael R. Lyu, and Shuming Shi. 2019b. Microblog Hashtag Generation via Encoding Conversation Contexts. In NAACL-HLT. 1624–1633.
  • Xiong and Litman (2014) Wenting Xiong and Diane J. Litman. 2014. Empirical Analysis of Exploiting Review Helpfulness for Extractive Summarization of Online Reviews. In COLING. 1985–1995.
  • Zhang et al. (2016) Qi Zhang, Yang Wang, Yeyun Gong, and Xuanjing Huang. 2016. Keyphrase Extraction Using Deep Recurrent Neural Networks on Twitter. In EMNLP. 836–845.
  • Zhang et al. (2018) Yingyi Zhang, Jing Li, Yan Song, and Chengzhi Zhang. 2018. Encoding Conversation Context for Neural Keyphrase Extraction from Microblog Posts. In NAACL-HLT. 1676–1686.