Abstractive Opinion Tagging
Abstract.
In e-commerce, opinion tags refer to a ranked list of tags provided by the e-commerce platform that reflect characteristics of reviews of an item. To assist consumers to quickly grasp a large number of reviews about an item, opinion tags are increasingly being applied by e-commerce platforms. Current mechanisms for generating opinion tags rely on either manual labelling or heuristic methods, which is time-consuming and ineffective. In this paper, we propose the abstractive opinion tagging task, where systems have to automatically generate a ranked list of opinion tags that are based on, but need not occur in, a given set of user-generated reviews.
The abstractive opinion tagging task comes with three main challenges: (1) the noisy nature of reviews; (2) the formal nature of opinion tags vs. the colloquial language usage in reviews; and (3) the need to distinguish between different items with very similar aspects. To address these challenges, we propose an abstractive opinion tagging framework, named AOT-Net, to generate a ranked list of opinion tags given a large number of reviews. First, a sentence-level salience estimation component estimates each review’s salience score. Next, a review clustering and ranking component ranks reviews in two steps: first, reviews are grouped into clusters and ranked by cluster size; then, reviews within each cluster are ranked by their distance to the cluster center. Finally, given the ranked reviews, a rank-aware opinion tagging component incorporates an alignment feature and alignment loss to generate a ranked list of opinion tags. To facilitate the study of this task, we create and release a large-scale dataset, called eComTag, crawled from real-world e-commerce websites. Extensive experiments conducted on the eComTag dataset verify the effectiveness of the proposed AOT-Net in terms of various evaluation metrics.
1. Introduction
Reviews of a hot-pot restaurant |
• : The waitress was extremely attentive and even gave us a free fried man tou dessert that came with condensed milk for dipping…I love it!! • : I was pleasantly surprised about how yummy the dish and the lamb were… • : All in all; was a great experience and the service is really above and beyond. • : The restaurant guest is more, can be served quickly, our table was quickly dish bowl filled with. Overall cool experience. • : The shrimp was fresh and the pork mixture was tasty. • : Fairly quick and polite service. It worth that price! • … excellent, relaxed and cozy atmosphere, and what can I say, satisfying. • : Food is delicious, reasonably priced…Go here! you deserve it! |
Opinion tags |
hospitable service (223), delicious food (165), value for money (104), comfortable environment (65), served quickly (14). |
With the explosive growth of customer reviews in e-commerce scenarios, many online platforms, such as Amazon111https://www.amazon.com/ and Alibaba222https://www.alibaba.com/, provide opinion tags to enable potential buyers to make informed decisions without having to absorb large numbers of reviews. As shown in Figure 1, a sequence of opinion tags is a ranked list mined from a set of reviews that reflects different users’ preferences towards certain aspects of items. Many studies have focused on mining valuable information from reviews and shown promising results in various tasks, such as opinion summarization (Amplayo and Lapata, 2019; Li et al., 2020; Brazinskas et al., 2020; Amplayo and Lapata, 2020; Suhara et al., 2020; Li et al., 2017, 2019a; Carmeli et al., 2020) and item description generation (Novgorodov et al., 2019; Elad et al., 2019). So far, however, no study seems to have developed opinion tagging methods to generate opinion tags that reflect diverse opinions of item aspects in a concise manner. In this paper, we propose the task of abstractive opinion tagging, which aims to automatically generate a ranked list of opinion tags that stem from, but need not occur in, a given set of user-generated reviews. This is an abstractive rather than an extractive opinion tagging task as the opinion tags do not need to occur in the review.
To solve this new task, we face three challenges. First, in reviews of e-commerce items, the noisy nature of the reviews inevitably makes it hard to identify salient item-related features (Gao et al., 2019; Li et al., 2020). According to human annotations (on the eComTag dataset described below), we find that almost 59% of review sentences are not item-related or do not contain opinions towards certain aspects. Second, different reviewers have different ways of expressing themselves (Tang et al., 2015), whereas the target opinion tags are usually in a more formal style. The colloquial language usage of user reviews (Wang et al., 2019a) makes abstractive methods necessary to learn better review representations. Third, it is difficult to reflect the different perspectives on an item in an accurate and diverse manner. Many e-commerce platforms rank opinion tags to help customers distinguish between different items with very similar aspects. In Figure 2, we plot the number of relevant reviews on different ranked indexes of opinion tags in the eComTag dataset (described below), which suggests a natural ranking for the tags when presented to users.

Reviews per opinion tag.
To address the challenges listed above, we design an abstractive framework, named AOT-Net, which consists of three components: (1) a sentence-level salience estimation component that predicts a salience score for each review; (2) a review clustering and ranking component that first groups reviews into clusters, which we refer to as “opinion clusters,” and ranks reviews by opinion cluster size; this component then ranks reviews within each cluster by their distance to the cluster center; and (3) a rank-aware opinion tagging component that generates opinion tags with ranks.
AOT-Net works in such a way that the ranks of the opinion clusters are correlated with the ranks of opinion tags. To see why this is potentially useful, we consider the eComTag dataset (crawled from e-commerce sites and consisting of items, reviews, and opinion tags), group reviews into opinion clusters and rank clusters by cluster size. We then compute the semantic similarity between the opinion clusters and the opinion tags, obtained by averaging the semantic similarity between an opinion tag and reviews in an opinion cluster. Figure 3 visualizes the average semantic similarity between opinion tags and opinion clusters for all item samples in the eComTag dataset. Clearly, the opinion tags have broader chunks with the opinion clusters at the same rank, revealing an alignment between the ranked opinion tags and ranked opinion clusters. Furthermore, the opinion tags are semantically similar to the neighbors of the corresponding opinion clusters, which is reflected by the relatively wide color bands, e.g., the second opinion tag is semantically similar to the first opinion cluster and the third opinion cluster. The rank-aware opinion tagging component of AOT-Net integrates two alignment strategies: (1) an opinion tag and its corresponding opinion clusters are placed at similar ranks; and (2) an alignment loss explicitly encourages the model to focus on the aligned opinion clusters and ignore others.
To validate AOT-Net, we collect a new dataset, named eComTag, from Chinese e-commerce websites, containing reviews and opinion tags for 50,068 items. Our experiments based on the eComTag dataset show that AOT-Net is capable of significantly improving the generation performance on abstractive opinion tagging over state-of-the-art baselines.

Semantic similarity.
Our contributions can be summarized as follows:
-
•
We propose a new task, abstractive opinion tagging, to generate opinion tags based on large volumes of item reviews.
-
•
We propose an abstractive framework AOT-Net to generate opinion tags based on reviews for a given item. AOT-Net has a sentence-level salience estimation component and a review clustering and ranking component to highlight salient reviews and rank reviews by clustering. We propose a rank-aware opinion tagging component with two alignment strategies to generate ranked opinion tags.
-
•
We collect a large-scale dataset, namely eComTag, consisting of item reviews and opinion tags, to support research into abstractive opinion tagging. Experimental results conducted on the eComTag dataset demonstrate the effectiveness of the AOT-Net framework.
2. Related Work
Related work comes in two categories: keyphrase generation and opinion summarization.
2.1. Keyphrase Generation
A lot of research has been conducted on generating keyphrases to summarize various types of text such as tweets, news reports, research articles, etc. (Meng et al., 2017; Chen et al., 2019; Liu et al., 2020b; Zhang et al., 2018; Wang et al., 2019b, a; Ray Chowdhury et al., 2019; Liu et al., 2020a). Early approaches to keyphrase generation extract important phrases from the document as the results. Sequence tagging models have been applied to identify keyphrases (Zhang et al., 2016; Luan et al., 2017; Gollapalli et al., 2017). Retrieval-based approaches utilize a two-step pipeline to extract and rank candidate keyphrases (Mihalcea and Tarau, 2004; Medelyan et al., 2009; Wang et al., 2016; Le et al., 2016). Sun et al. (2019) adopt an extractive graph-based approach, which applies a point network to generate a set of diverse keyphrases. Recently, abstractive approaches have also been explored. meng2017deep are the first to employ attention-based seq2seq framework with copy mechanism to conduct abstractive keyphrase generation. Chan et al. (2019) propose a reinforcement learning approach for neural keyphrase generation that encourages a model to generate both sufficient and accurate keyphrases. Wang et al. (2019a) propose a topic-aware neural keyphrase generation method to identify topic words.
Unlike the work listed above, which only considers keyphrase generation for single document, we consider opinion tagging from multiple documents, that is, from all of the reviews for a given item.
2.2. Opinion Summarization
Opinion summarization has become an emerging research topic in recent years. Early studies on opinion summarization focus on extracting salient sentences from the original review text (Hu and Liu, 2004; Carenini et al., 2006; Lu et al., 2009; Ganesan et al., 2010; Xiong and Litman, 2014; Angelidis and Lapata, 2018): Hu and Liu (2004) identify item features mentioned in the reviews and then extract opinion sentences for the identified features. Xiong and Litman (2014) utilize unsupervised learning methods to extract review summaries by exploiting review helpfulness ratings. Angelidis and Lapata (2018) present a weakly supervised neural framework for aspect-based opinion summarization by combining the tasks of aspect extracting and sentiment predicting. Reflecting the most representative opinions from reviewers, many recent studies have shown that abstractive approaches are more appropriate for summarizing review text (Carenini et al., 2013; Gerani et al., 2014; Fabbrizio et al., 2014; Li et al., 2019b; Tay, 2019): Gerani et al. (2014) utilize a template filling strategy to indirectly generate a review summary; Wang and Ling (2016) apply an attention-based encoder-decoder framework to generate an abstractive summary for opinionated documents. The main objective of the above summarization approaches is to generate coherent sentences to summarize opinions.
In contrast, we propose the abstractive opinion tagging task so as to generate opinion tags from a large number of user-generated reviews. In our scenario, opinion tags are more concise but without loss of essential information; they should help users comprehend reviews quickly and conveniently (Li et al., 2017).
3. Problem Formulation
Before detailing our proposed method, AOT-Net, we first formulate the abstractive opinion tagging problem. We use bold lowercase characters to denote vectors, and bold upper case characters to denote matrices. We write W and b for a projection matrix and a bias vector in a neural network layer, respectively. Suppose that there are reviews for a given item. We denote each review as a sequence of words, i.e., , where denotes the number of words in . In the same way, we assume that opinion tags exist for a given item. We denote each opinion tag as a sequence of words, i.e., , where refers to the number of words in . Given a set of reviews , the task of abstractive opinion tagging is to generate a sequence of opinion tags .
4. Method

Model overview.
4.1. Overview
Before providing the details of AOT-Net, our proposed method for abstractive opinion tagging, we first provide an overview in Figure 4. We divide AOT-Net into three main phases: (A) sentence-level salience estimation; (B) review clustering and ranking; and (C) rank-aware opinion tagging. For a set of reviews about a given item, in phase A, we derive a salience score for each review to estimate its item-aware salience information. In phase B, reviews are first encoded into vector representations and weighted by corresponding salience scores. The weighted vector representations of reviews are clustered into opinion clusters and ranked by cluster size. Reviews within each cluster are ranked by their distance to the cluster center. Then we flatten ranked reviews into word-level vector representations. In phase C, we use review representations to generate ranked opinion tags via two alignment constraints, i.e., alignment features and alignment loss. We jointly learn all components in a multi-task learning framework.
4.2. A: Sentence-level Salience Estimation
The aim of the sentence-level salience estimation component is to compute a salience score for each review . We design a sentence-level self-attention mechanism to highlight item-related reviews and reduce noise. First, the component reads each review sequence and uses a lookup table to convert each review word to a word embedding vector . To incorporate the contextual information of the review text into the representation of each word, we feed each embedding vector to a bi-directional Gated-Recurrent Unit (GRU) (Cho et al., 2014) to learn a hidden representation . More specifically, a bi-directional GRU consists of a forward GRU that reads the embedding sequence from to and a backward GRU that reads from to :
(1) | ||||
(2) |
where and denote the hidden states of the forward and backward , respectively. We concatenate the last forward hidden state and last backward hidden state to form the hidden representation for review , i.e., .
Next, we pass hidden representations of all reviews to a self-attention layer to model more complex interactions among the reviews. We propose a salience context vector for each review to denote the shared information from other reviews:
(3) | ||||
(4) |
where refer to query, key, and value vectors, respectively. These vectors are linearly transformed from review hidden representation . Then we apply a residual connection from the review hidden representation to the salience context vector and feed it to a two-layer feed-forward network with a ReLU as the activation function:
(5) |
Given the context-enhanced review hidden representation , we can derive the real-valued salience score for :
(6) |
where and . is the sigmoid activation function. The salience scores serve as the salience weights of review representations for the later review clustering and ranking component.
In order to optimize the sentence-level salience estimating component, we manually label a binary salience label for each review , where 1 denotes “item-related” whereas 0 denotes “noisy”. Then, sentence-level salience estimation component is trained by minimizing the cross-entropy loss function:
(7) |
4.3. B: Review Clustering and Ranking
We propose a review clustering and ranking component to learn the ranks of reviews by grouping reviews into ranked opinion clusters, which is the main prerequisite to accurately generate ranked opinion tags.
We use a standard transformer encoder (Vaswani et al., 2017) to convert each review into vector representations. Following Vaswani et al. (2017), we first map each word into its vectorized representation using a word embedding layer and a positional embedding layer, as shown in the following equation:
(8) | x |
Then we use a transformer layer to encode global contextual information for words within .
(9) | ||||
(10) |
where LayerNorm is the layer normalization proposed by Ba et al. (2016); MHAtt is the multi-head attention mechanism introduced by Vaswani et al. (2017); FFN is a two-layer feed-forward network with ReLU as hidden activation function; and is the number of transformer block layers. The word-level vector representations of review are . To obtain the sentence representation of review , we perform a hierarchical pooling operation (Shen et al., 2018) across its different words. The hierarchical pooling mechanism can preserve word order information and has demonstrated superior performance over mean-pooling or max-pooling on many semantic analysis tasks (Shen et al., 2018).
To highlight the item-aware reviews and ignore noisy reviews, the sentence representations of reviews are first weighted by the corresponding salience scores, i.e., . Then we apply the -means (MacQueen, 1967) algorithm on to group corresponding into opinion clusters.333 is manually assigned according to the number of reviews. If , , otherwise, . We rank opinion clusters from the largest (representing the highest number of reviews) to the smallest, denoted as . For each cluster, we rank reviews from the nearest (representing the distance between review and cluster center) to the farthest. Finally, we obtain a ranked list of reviews, represented as , where is the -th review in the -th opinion cluster and is the number of reviews in the -th opinion cluster.
We sequentially concatenate the vector representations of reviews in the ranked list and derive the final word-level representations, i.e., .444In this paper, we only focus on the ranks of opinion clusters. We regard reviews or words in the same opinion cluster as equally important. Similarly, is the -th word in the -th opinion cluster and is the number of words in the -th opinion cluster. Next, will serve as the memory bank for the later rank-aware opinion tagging component.
4.4. C: Rank-aware Opinion Tagging
Accurately generating opinion tags with ranks is challenging. Therefore, we propose a rank-aware opinion tagging component to generate ranked opinion tags.
In the training stage, we add a start token BOS at the beginning of each opinion tag, i.e., where is the concatenation function. Then we concatenate all opinion tags into a sequence of words: , where is the -th word of opinion tag , is the BOS token of the -th opinion tag. Our decoder, i.e., the rank-aware opinion tagging component, follows the transformer architecture (Vaswani et al., 2017).
From the data analysis in Figure 3, we know that the ranks of opinion tags have a strong correlation with the ranks of opinion clusters. The -th opinion tag may pay attention to the -th opinion cluster and its surrounding neighbors simultaneously. Therefore, for each opinion tag , we hypothesize that the model needs to focus on the most related opinion clusters.555 is a hyperparameter. opinion clusters mean the -th opinion clusters and its surrounding neighbors. If an opinion cluster belongs to the focused opinion clusters, we call it a Focused Opinion Cluster (FOC). Otherwise, we call it an Outer Opinion Cluster (OOC). We design two alignment strategies between between FOCs and opinion tags, i.e., incorporating alignment features and enforcing alignment loss, which help to improve the generation of ranked opinion tags.
Alignment Feature
Intuitively, the opinion tags and their FOCs are semantically similar in the vector space. To help the model capture the alignment between opinion tags and their FOCs, we incorporate alignment features into the word-level representations of the opinion tags and opinion clusters. Formally, is a function that maps an integer into a vector. Now, we explain how we will use it to represent the ranks of opinion tags and opinion clusters. First, we incorporate the alignment feature into the representations of opinion tags. The words in the -th opinion tag have the same rank where . For each word , the vectorized representation is the sum of the alignment feature , the word embedding , and the positional embedding :
(11) |
where is a trainable model parameter.

For the target -th opinion tag, the ranks of FOCs are set to as well, while the ranks of OOCs are set to . Then we enhance the word vectors in X with alignment features to capture the alignment between reviews and opinion tags:
(12) |
Next, the alignment features are added into the review representations to obtain the alignment-enhanced representations for the words in opinion clusters:
(13) |
where is a model parameter. Figure 5 summarizes the construction.
At each decoding step of the -th opinion tag, the decoder reads the embeddings of the last prediction .
(14) |
where is the target word representation. Next, we introduce our decoder in detail.
To capture semantic and alignment information from the opinion clusters, a multi-head cross-attention MHAtt (Vaswani et al., 2017) is applied to compute the attention score between the last prediction and :
(15) | ||||
(16) | ||||
(17) |
where are query and key vectors that are linearly transformed from and as in (Vaswani et al., 2017); indicates the -th head among heads; is the dimension of each head.
The attention scores are then used to compute an aggregated vector for target word :
(18) |
Then we feed the last word representation and vector to a two-layer feed-forward network with a ReLU as the activation function and a highway layer normalization on top:
(19) | ||||
(20) |
where is the target word representation. After that, we use to compute a probability distribution over the words in a predefined vocabulary , as shown in the following equation:
(21) |
where , are trainable parameters. To enable our model to generate out-of-vocabulary (OOV) words, we adopt the copy mechanism (See et al., 2017) to predict OOV words by directly copying words from the opinion clusters. We first compute a soft gate between generating a word from the predefined vocabulary and copying a word from the input reviews :
(22) |
where and are trainable parameters. Finally, we can derive the final next-word probability distribution :
(23) | ||||
(24) |
where we use to denote for brevity. We use the negative log-likelihood of the ground-truth words as the generation loss function:
(25) |
Alignment Loss
During the generation of the -th opinion tag, we enforce an alignment loss to help locate FOCs accurately. The model is explicitly taught to focus on FOCs and ignore OOCs in the attention via the following loss:
(26) |
4.5. Multi-task Training Objective
We adopt a multi-task learning framework to jointly minimize the salience classification loss, alignment loss, and generation loss. The objective function is:
(27) |
where are hyper-parameters that control the weights of these three losses. We set . Thus, each component of our joint model can be trained end-to-end.
5. Experimental Setup
We set up experiments to compare AOT-Net against a number of relevant baselines. We are interested in the overall performance of AOT-Net and in understanding the effectiveness of the salience estimation and ranking alignment.
5.1. Experiments
We report on five experiments. First, we compare AOT-Net against a number of baselines to assess its overall performance. Then we conduct ablation studies to analyze the influence of different components in AOT-Net as follows: (i) w/o SSEis AOT-Net without the sentence-level salience estimating component (SSE). (ii) w/o RCRis AOT-Net without the review clustering and ranking component (RCR). (iii) w/o AFis AOT-Net without alignment feature (AF). (iv) w/o ALis AOT-Net without alignment loss (AL). To further explore the effectiveness of the sentence-level self-attention mechanism in SSE, we consider AOT-RNN, the method only considers BiGRU in salience score prediction; whereas we write AOT-Embed for the method that employs MLP to replace BiGRU in SSE. Fourth, we analyze the performance of AOT-Net for different sizes of FOCs. Lastly, we provide a case study about abstractive opinion tagging.
5.2. Baselines
We compare AOT-Net with the following methods: (i) TF-IDFis an extractive approach that selects the important words as summary based on term frequency and inverse document frequency; (ii) TextRank(Mihalcea and Tarau, 2004)is an unsupervised algorithm based on weighted-graphs; (iii) RNNis a sequence to sequence model with attention implemented by bi-directional GRU layer (Cho et al., 2014); (iv) PG-Net(See et al., 2017)is a classical opinion summarization model based on the encoder-decoder framework with attention and copy mechanisms; and (v) the Transformer (Vaswani et al., 2017) is a Transformer-based encoder-decoder model with a copy mechanism, which is a strong baseline widely-adopt in opinion summarization.
5.3. The eComTag Dataset
Since there is no available opinion tagging dataset, we build a new one, named eComTag from several Chinese e-commerce websites. We collect nearly 112k items, sampling from different domains, including Cosmetic (), Electronics (), Books (), Entertainment (), Food (), Sports (), Clothes (), Medical (), and Furniture (). For each domain, there are a set of reviews and a list of opinion tags. Since reviewers may comment on multiple aspects, e.g., “The dim sum tasted extremely fresh, and the price was quite reasonable!”, we split each review into sentences by punctuation. We use a sentence to denote a review in our paper. Then, we remove samples where the number of opinion tags is smaller than 4 or the number of reviews is fewer than 50. Finally, we construct eComTag with 50,068 item samples. Users may write reviews arbitrarily, which results in many meaningless expressions, such as “Love, love, LOVE this space!”, and “come with my boyfriend.” To teach our model to distinguish these noisy sentences, we annotate each review with a binary salience label via human judgment. If a review is item-related, we label the review as 1, otherwise 0. The salience labels are the supervision signals for the sentence-level salience estimating component.
Finally, each item sample consists of a set of reviews, a set of corresponding salience labels, and a sequence of opinion tags. For text preprocessing, we tokenize texts using the Jieba toolkit666https://github.com/fxsjy/jieba and maintain a 50k vocabulary. In eComTag, about 30% of samples have more than words in reviews. We randomly split the dataset into training/validation/test sets with 8:1:1 ratio. The statistics are shown in Table 1. Specially, we define the present tag (Pr) as the exact tag that appears in reviews and absent tag (Ab) as the tag unseen in reviews. The proportion of absent tags is close to 75%, which further proves the necessity to apply abstractive methods on opinion tagging.
5.4. Evaluation Metrics
We employ two information retrieval metrics to evaluate the opinion tag generation: the macro F score and the normalized discounted cumulative gain (NDCG) score. Both are widely used to measure word overlap (Sun et al., 2019; Chan et al., 2020). To measure diversity of the generated opinion tags, we adopt the Distinct- score (Li et al., 2016) and a macro Unique- score. We compute Unique- as follows, , where is the number of distinct opinion tags in the -th sample. Moreover, we design two metrics, Exact Rank Match (ERM) and Fuzzy Rank Match (FRM), to evaluate the rank accuracy of opinion tags. ERM is defined as the one-to-one exact match proportion between true tags and predicted tags. Inspired by the Embedding Score (Liu et al., 2016), FRM first maps predicted tags and the corresponding true tags into the same vector space, and then computes the average cosine similarity between their vector representations.
Data | Sample | Pr | Ab | MTN | MTL |
Training | 40,162 | 24.1% | 75.9% | 19 | 40 |
Validation | 4,953 | 24.3% | 75.7% | 20 | 39 |
Test | 4,953 | 24.1% | 75.9% | 14 | 32 |
5.5. Implementation Details
We adopt the Adam (Kingma and Ba, 2015) optimizer with settings {,,,} and we vary the learning rate following Vaswani et al. (2017). We add dropout (Srivastava et al., 2014) with keeping rate 0.8 and label smoothing (Szegedy et al., 2016) with smoothing factor 0.1. We use the Tencent AI Lab Chinese Embeddings777https://ai.tencent.com/ailab/nlp/embedding.html for initialization of the word embedding layers. The rest of the parameters are randomly initialized. The dimensions of the alignment feature, word embedding layers and positional embedding layers are set to 200. We set the batch size to 16 and use the validation loss for early stopping. When inference, we set the maximum decoding step as 50. We use a bidirectional GRU (Cho et al., 2014) with 2 layers to implement the sentence-level salience estimating component. All RNN-based models have 256 hidden units. All transformer-based models have 300 hidden units; the feed-forward hidden size is set to 50 for all layers. We set the in Section 4.4 to ( is the number of Focused Opinion Clusters at each decoding step) to ensure the target tag token have enough relevant reviews to reference and avoid introducing too much interference information simultaneously. AOT-Net was trained on a single Tesla V100 GPU and is implemented using PyTorch. All hyperparameters and models are selected on the validation set and the results are reported on the test set.
Distinct-2 | |||||||||
Models | F | F | NDCG@5 | NDCG@10 | ERM | FRM | Micro | Macro | Unique-N |
TF-IDF | 0.0039 | 0.0038 | 0.0168 | 0.0169 | 0.16 | 0.19 | – | – | – |
TextRank | 0.0019 | 0.0018 | 0.0091 | 0.0097 | 0.06 | 0.21 | – | – | – |
RNN | 0.2895 | 0.2753 | 0.7383 | 0.7701 | 0.17 | 0.44 | 0.60 | 62.19 | 7.447 |
PG-Net | 0.3138 | 0.2896 | 0.7600 | 0.8009 | 0.19 | 0.44 | 1.33 | 65.78 | 6.798 |
Transformer | 0.2833 | 0.2756 | 0.6916 | 0.7483 | 0.25 | 0.59 | 1.57 | 89.23 | 8.851 |
AOT-Net | 0.3529 | 0.3492 | 0.7473 | 0.8045 | 0.31 | 0.64 | 1.30 | 94.35 | 8.953 |
w/o SSE | 0.2930 | 0.2822 | 0.7022 | 0.7563 | 0.25 | 0.61 | 1.11 | 91.85 | 8.957 |
w/o RCR | 0.3434 | 0.3370 | 0.7353 | 0.7913 | 0.31 | 0.63 | 1.77 | 92.94 | 8.935 |
w/o AF | 0.3141 | 0.3056 | 0.7194 | 0.7768 | 0.28 | 0.62 | 1.21 | 93.23 | 8.997 |
w/o AL | 0.3406 | 0.3336 | 0.7322 | 0.7857 | 0.30 | 0.64 | 1.30 | 93.11 | 8.926 |
6. Results and Analysis
Overall evaluation results on generating opinion tags are listed in Table 2. We find that all the abstractive models significantly outperform all the traditional extractive baselines. Thus we conclude that informal and colloquial nature of user-generated reviews make item-related features indiscernible using the unsupervised extraction methods. As expected, we also find that PG-Net significantly outperforms RNN, which implies that the copying mechanism is useful for opinion summarization. We can see AOT-Net significantly outperforms baseline Transformer in terms of all metrics and achieves the best performance for most metrics. The results of our ablation studies are shown in the lower part of the Table 2. We observe that after removing the SSE component, performances of AOT-Net in terms of most metrics drops obviously. If we do not rank and group reviews into opinion clusters before decoding (i.e., w/o RCR), although the diversity metric has a slight increase, the rank accuracy of AOT-Net decreases as we anticipated. We also find that after removing alignment feature or alignment loss mechanisms in the decoder, the performance of both retrieval and rank accuracy metrics (i.e., and ERM) degrades. We will conduct a detailed analysis of the individual components in the following sections.
6.1. Salience Estimation Analysis
As shown in Table 2, AOT-Net achieves a % and % increase over “w/o SSE” in terms of Micro-Distinct-2 and Macro-Distinct-2, respectively. Similar improvements can be observed for other metrics. This demonstrates that the predicted salience scores help AOT-Net to focus on more valuable reviews. To verify the effectiveness of SSE with more details, in Figure 6 we list the accuracy scores of AOT-Embed, AOT-RNN, and AOT-Net for sentence-level salience estimation. We find that AOT-Net outperforms both AOT-Embed and AOT-RNN, which verifies the effectiveness of BiGRU and self-attention mechanisms. AOT-RNN achieves % increase over AOT-Embed in terms of accuracy for all reviews, which verifies the advantage of BiGRU in representing review. In terms of accuracy, we find that AOT-Net gives a % and % increase over AOT-RNN for all reviews and item-related reviews, respectively. This indicates that AOT-Net benefits from self-attention mechanisms, which capture the shared information to distinguish item-related reviews.

6.2. Number of FOCs
To evaluate the effect of the number of FOCs on the performance of rank-aware opinion tagging, we examine the performance of AOT-Net with different values of (see Section 4.4) in terms of ERM, FRM, and Distinct-2, respectively. As shown in Table 3, significantly decreases the model rank and diversity performance. This suggests that only focusing on a single opinion cluster ignores many related reviews. We also find that when , the performance of AOT-Net slightly decreases; AOT-Net achieves the best performance in terms of all metrics when . Hence, we infer that is a trade-off between focusing on relevant reviews and removing irrelevant noise.
Models | ERM | FRM | Macro Distinct-2 |
=1 | 0.30 | 0.62 | 93.24 |
=3 | 0.31 | 0.64 | 94.35 |
=5 | 0.30 | 0.63 | 93.32 |

An example of cross-attention transit.
6.3. Case Study
Figure 7 shows an example illustrating the 5 highest attention weights during the generation of opinion tags. We see that the model transits its focus smoothly from the first review sentence to later review sentences. Sometimes, the model may focus on the same review for two opinion tags, such as “The service (st tag) is good and hairstyle (nd tag) is good.” The first three predicted tags were completely accurate. However, the transformer only predicted the first tag accurately. To validate the effectiveness of the alignment loss, we calculate and for all items in the test set. Results show that and in AOT-Net are and on average. However, for AOT-Net w/o AL, and are and , respectively. This phenomenon indicates that the alignment feature by itself is not enough to force AOT-Net to focus on FOCs. In particular, we compare the ranks of opinion tags generated by the transformer and AOT-Net respectively. For the transformer, items where the first 3 opinion tags have accurate ranks only account for near 6.11%, while for AOT-Net, the proportion can reach 9.68%. Thus, we conclude that the alignment feature and alignment loss in AOT-Net are helpful to capture the ranks of the opinion tags.
7. Conclusion and Future Work
In this paper, we have proposed the abstractive opinion tagging task, which aims to automatically generate a ranked list of opinion tags from a large number of reviews. We have proposed a rank-aware abstractive opinion tagging framework (AOT-Net) that includes a sentence-level salience estimating component, a review clustering and ranking component, and a rank-aware opinion tagging component. To validate the effectiveness of AOT-Net, we conduct extensive experiments on a newly collected real-world dataset, eComTag. Experiments show that AOT-Net achieves state-of-the-art performance on the abstractive opinion tagging task. AOT-Net has two main advantages over previous work. On the one hand, it generates more concise opinion tags; on the other hand, the ranked lists of generated opinion tags help users distinguish products with very similar aspects. Our work provides a plausible solution to greatly reduce human annotation costs for online e-commerce opinion tagging. Although we focused mostly on e-commerce portals, our methods are also broadly applicable to other settings with opinionated content, such as microblogs.
Limitations of our work include its low efficiency and coarse-grained salience estimation. As to our future work, we will adopt the deep clustering network instead of -means algorithm. Also, pre-trained language models could provide more power to enhance our sentence salience estimation. Few-shot learning for handling unbalanced review distributions among different domains could be another direction. It will be also interesting to explore user interactions with AOT-Net to generate personalized opinion tags in the future.
Code and Data
The source code and dataset used in this paper are available at https://github.com/qtli/AOT.
Acknowledgements.
We thank our reviewers for valuable feedback. This work was supported by the National Key R&D Program of China with grant No. 2020YFB1406704, the Natural Science Foundation of China (61972234, 61902219, 61672324, 61672322, 62072279), the Key Scientific and Technological Innovation Program of Shandong Province (2019JZZY010129), the Tencent AI Lab Rhino-Bird Focused Research Program (JR201932), the Fundamental Research Funds of Shandong University, the Foundation of State Key Laboratory of Cognitive Intelligence, iFLYTEK, P.R. China (COGOSC-20190003). All content represents the opinion of the authors, which is not necessarily shared or endorsed by their respective employers and/or sponsors.References
- (1)
- Amplayo and Lapata (2019) Reinald Kim Amplayo and Mirella Lapata. 2019. Informative and Controllable Opinion Summarization. CoRR abs/1909.02322 (2019).
- Amplayo and Lapata (2020) Reinald Kim Amplayo and Mirella Lapata. 2020. Unsupervised Opinion Summarization with Noising and Denoising. In ACL. 1934–1945.
- Angelidis and Lapata (2018) Stefanos Angelidis and Mirella Lapata. 2018. Summarizing Opinions: Aspect Extraction Meets Sentiment Prediction and They Are Both Weakly Supervised. In EMNLP. 3675–3686.
- Ba et al. (2016) Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normalization. CoRR abs/1607.06450 (2016).
- Brazinskas et al. (2020) Arthur Brazinskas, Mirella Lapata, and Ivan Titov. 2020. Unsupervised Opinion Summarization as Copycat-Review Generation. In ACL. 5151–5169.
- Carenini et al. (2013) Giuseppe Carenini, Jackie Chi Kit Cheung, and Adam Pauls. 2013. Multi-Document Summarization of Evaluative Text. Comput. Intell. 29, 4 (2013), 545–576.
- Carenini et al. (2006) Giuseppe Carenini, Raymond T. Ng, and Adam Pauls. 2006. Multi-Document Summarization of Evaluative Text. In EACL. The Association for Computer Linguistics.
- Carmeli et al. (2020) Nofar Carmeli, Xiaolan Wang, Yoshihiko Suhara, Stefanos Angelidis, Yuliang Li, Jinfeng Li, and Wang-Chiew Tan. 2020. ExplainIt: Explainable Review Summarization with Opinion Causality Graphs. CoRR abs/2006.00119 (2020).
- Chan et al. (2020) Hou Pong Chan, Wang Chen, and Irwin King. 2020. A Unified Dual-view Model for Review Summarization and Sentiment Classification with Inconsistency Loss. In SIGIR. 1191–1200.
- Chan et al. (2019) Hou Pong Chan, Wang Chen, Lu Wang, and Irwin King. 2019. Neural Keyphrase Generation via Reinforcement Learning with Adaptive Rewards. In ACL. 2163–2174.
- Chen et al. (2019) Wang Chen, Yifan Gao, Jiani Zhang, Irwin King, and Michael R Lyu. 2019. Title-Guided Encoding for Keyphrase Generation. In AAAI, Vol. 33. 6268–6275.
- Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In EMNLP. 1724–1734.
- Elad et al. (2019) Guy Elad, Ido Guy, Slava Novgorodov, Benny Kimelfeld, and Kira Radinsky. 2019. Learning to Generate Personalized Product Descriptions. In CIKM. 389–398.
- Fabbrizio et al. (2014) Giuseppe Di Fabbrizio, Amanda Stent, and Robert J. Gaizauskas. 2014. A Hybrid Approach to Multi-document Summarization of Opinions in Reviews. In INLG. 54–63.
- Ganesan et al. (2010) Kavita Ganesan, ChengXiang Zhai, and Jiawei Han. 2010. Opinosis: A Graph Based Approach to Abstractive Summarization of Highly Redundant Opinions. In COLING. 340–348.
- Gao et al. (2019) Shen Gao, Zhaochun Ren, Yihong Eric Zhao, Dongyan Zhao, Dawei Yin, and Rui Yan. 2019. Product-Aware Answer Generation in E-Commerce Question-Answering. In WSDM. 429–437.
- Gerani et al. (2014) Shima Gerani, Yashar Mehdad, Giuseppe Carenini, Raymond T. Ng, and Bita Nejat. 2014. Abstractive Summarization of Product Reviews Using Discourse Structure. In EMNLP. 1602–1613.
- Gollapalli et al. (2017) Sujatha Das Gollapalli, Xiao-Li Li, and Peng Yang. 2017. Incorporating Expert Knowledge into Keyphrase Extraction. In AAAI.
- Hu and Liu (2004) Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In SIGKDD. ACM, 168–177.
- Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR.
- Le et al. (2016) Tho Thi Ngoc Le, Minh Le Nguyen, and Akira Shimazu. 2016. Unsupervised Keyphrase Extraction: Introducing New Kinds of Words to Keyphrases. In AI 2016 (LNCS), Vol. 9992. 665–671.
- Li et al. (2016) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A Diversity-Promoting Objective Function for Neural Conversation Models. In NAACL. 110–119.
- Li et al. (2019b) Junjie Li, Xuepeng Wang, Dawei Yin, and Chengqing Zong. 2019b. Attribute-aware Sequence Network for Review Summarization. In EMNLP-IJCNLP. 2998–3008.
- Li et al. (2020) Pengyuan Li, Lei Huang, and Guang-jie Ren. 2020. Topic Detection and Summarization of User Reviews. CoRR abs/2006.00148 (2020).
- Li et al. (2019a) Piji Li, Zihao Wang, Lidong Bing, and Wai Lam. 2019a. Persona-Aware Tips Generation. In The World Wide Web Conference. 1006–1016.
- Li et al. (2017) Piji Li, Zihao Wang, Zhaochun Ren, Lidong Bing, and Wai Lam. 2017. Neural Rating Regression with Abstractive Tips Generation for Recommendation. In SIGIR. 345–354.
- Liu et al. (2016) Chia-Wei Liu, Ryan Lowe, Iulian Serban, Michael Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation. In EMNLP. 2122–2132.
- Liu et al. (2020a) Dayiheng Liu, Yeyun Gong, Jie Fu, Wei Liu, Yu Yan, Bo Shao, Daxin Jiang, Jiancheng Lv, and Nan Duan. 2020a. Diverse, Controllable, and Keyphrase-Aware: A Corpus and Method for News Multi-Headline Generation. arXiv preprint arXiv:2004.03875 (2020).
- Liu et al. (2020b) Rui Liu, Zheng Lin, and Weiping Wang. 2020b. Keyphrase Prediction With Pre-trained Language Model. arXiv preprint arXiv:2004.10462 (2020).
- Lu et al. (2009) Yue Lu, ChengXiang Zhai, and Neel Sundaresan. 2009. Rated Aspect Summarization of Short Comments. In WWW. 131–140.
- Luan et al. (2017) Yi Luan, Mari Ostendorf, and Hannaneh Hajishirzi. 2017. Scientific Information Extraction with Semi-supervised Neural Tagging. In EMNLP.
- MacQueen (1967) James MacQueen. 1967. Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1. Oakland, CA, USA, 281–297.
- Medelyan et al. (2009) Olena Medelyan, Eibe Frank, and Ian H Witten. 2009. Human-competitive Tagging using Automatic Keyphrase Extraction. In EMNLP. 1318–1327.
- Meng et al. (2017) Rui Meng, Sanqiang Zhao, Shuguang Han, Daqing He, Peter Brusilovsky, and Yu Chi. 2017. Deep Keyphrase Generation. CoRR abs/1704.06879 (2017).
- Mihalcea and Tarau (2004) Rada Mihalcea and Paul Tarau. 2004. Textrank: Bringing order into text. In EMNLP. 404–411.
- Novgorodov et al. (2019) Slava Novgorodov, Ido Guy, Guy Elad, and Kira Radinsky. 2019. Generating Product Descriptions from User Reviews. In WWW. 1354–1364.
- Ray Chowdhury et al. (2019) Jishnu Ray Chowdhury, Cornelia Caragea, and Doina Caragea. 2019. Keyphrase extraction from disaster-related tweets. In WWW. 1555–1566.
- See et al. (2017) Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get To The Point: Summarization with Pointer-Generator Networks. In ACL. 1073–1083.
- Shen et al. (2018) Dinghan Shen, Guoyin Wang, Wenlin Wang, Martin Renqiang Min, Qinliang Su, Yizhe Zhang, Chunyuan Li, Ricardo Henao, and Lawrence Carin. 2018. Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms. In ACL. 440–450.
- Srivastava et al. (2014) Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res. 15, 1 (2014), 1929–1958.
- Suhara et al. (2020) Yoshihiko Suhara, Xiaolan Wang, Stefanos Angelidis, and Wang-Chiew Tan. 2020. OpinionDigest: A Simple Framework for Opinion Summarization. In ACL. 5789–5798.
- Sun et al. (2019) Zhiqing Sun, Jian Tang, Pan Du, Zhi-Hong Deng, and Jian-Yun Nie. 2019. DivGraphPointer: A Graph Pointer Network for Extracting Diverse Keyphrases. In SIGIR. 755–764.
- Szegedy et al. (2016) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2016. Rethinking the Inception Architecture for Computer Vision. In CVPR. 2818–2826.
- Tang et al. (2015) Duyu Tang, Bing Qin, Ting Liu, and Yuekui Yang. 2015. User Modeling with Neural Network for Review Rating Prediction. In IJCAI.
- Tay (2019) Wenyi Tay. 2019. Not All Reviews Are Equal: Towards Addressing Reviewer Biases for Opinion Summarization. In ACL. 34–42.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS. 6000–6010.
- Wang and Ling (2016) Lu Wang and Wang Ling. 2016. Neural Network-Based Abstract Generation for Opinions and Arguments. In NAACL. 47–57.
- Wang et al. (2016) Minmei Wang, Bo Zhao, and Yihua Huang. 2016. PTR: Phrase-based Topical Ranking for Automatic Keyphrase Extraction in Scientific Publications. In ICONIP. 120–128.
- Wang et al. (2019a) Yue Wang, Jing Li, Hou Pong Chan, Irwin King, Michael R. Lyu, and Shuming Shi. 2019a. Topic-Aware Neural Keyphrase Generation for Social Media Language. In ACL. 2516–2526.
- Wang et al. (2019b) Yue Wang, Jing Li, Irwin King, Michael R. Lyu, and Shuming Shi. 2019b. Microblog Hashtag Generation via Encoding Conversation Contexts. In NAACL-HLT. 1624–1633.
- Xiong and Litman (2014) Wenting Xiong and Diane J. Litman. 2014. Empirical Analysis of Exploiting Review Helpfulness for Extractive Summarization of Online Reviews. In COLING. 1985–1995.
- Zhang et al. (2016) Qi Zhang, Yang Wang, Yeyun Gong, and Xuanjing Huang. 2016. Keyphrase Extraction Using Deep Recurrent Neural Networks on Twitter. In EMNLP. 836–845.
- Zhang et al. (2018) Yingyi Zhang, Jing Li, Yan Song, and Chengzhi Zhang. 2018. Encoding Conversation Context for Neural Keyphrase Extraction from Microblog Posts. In NAACL-HLT. 1676–1686.