\useunder

\ul

Training Entire-Space Models for Target-oriented Opinion Words Extraction

Yuncong Li Tencent IncShenzhenChina [email protected] , Fang Wang Shenzhen UniversityShenzhenChina [email protected] and Sheng-Hua Zhong Shenzhen UniversityShenzhenChina [email protected]

(2022)

Abstract.

Target-oriented opinion words extraction (TOWE) is a subtask of aspect-based sentiment analysis (ABSA). Given a sentence and an aspect term occurring in the sentence, TOWE extracts the corresponding opinion words for the aspect term. TOWE has two types of instance. In the first type, aspect terms are associated with at least one opinion word, while in the second type, aspect terms do not have corresponding opinion words. However, previous researches trained and evaluated their models with only the first type of instance, resulting in a sample selection bias problem. Specifically, TOWE models were trained with only the first type of instance, while these models would be utilized to make inference on the entire space with both the first type of instance and the second type of instance. Thus, the generalization performance will be hurt. Moreover, the performance of these models on the first type of instance cannot reflect their performance on entire space. To validate the sample selection bias problem, four popular TOWE datasets containing only aspect terms associated with at least one opinion word are extended and additionally include aspect terms without corresponding opinion words. Experimental results on these datasets show that training TOWE models on entire space will significantly improve model performance and evaluating TOWE models only on the first type of instance will overestimate model performance¹¹1Data and code are available at https://github.com/l294265421/SIGIR22-TOWE.

target-oriented opinion words extraction, aspect-based sentiment analysis, sample selection bias

^†^†journalyear: 2022^†^†copyright: acmcopyright^†^†conference: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval; July 11–15, 2022; Madrid, Spain.^†^†booktitle: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’22), July 11–15, 2022, Madrid, Spain^†^†price: 15.00^†^†isbn: 978-1-4503-8732-3/22/07^†^†doi: 10.1145/XXXXXX.XXXXXX^†^†ccs: Information systems Sentiment analysis

1. Introduction

Aspect-based sentiment analysis (ABSA) (Hu and Liu, 2004; Pontiki et al., 2014, 2015, 2016) is a branch of sentiment analysis (Nasukawa and Yi, 2003; Liu, 2012). Target-oriented opinion words extraction (TOWE) (Fan et al., 2019) is a subtask of ABSA. Given a sentence and an aspect term occurring in the sentence, TOWE extracts the corresponding opinion words for the aspect term. For example, as shown in Figure 1, given the sentence “Try the rose roll (not on menu). ” and an aspect term “rose roll” appearing in the sentence, TOWE extracts the opinion word “Try”.

Refer to caption — Figure 1. Two examples of TOWE task. The words highlighted in blue represent the given aspect terms, while the words in orange represent the corresponding opinion words.

TOWE has two types of instance. In the first type of instance, called Type I instance, aspect terms are associated with at least one opinion word. An example of Type I instance is shown in Figure 1 (a). In the second type of instance, called Type II instance, aspect terms don’t have corresponding opinion words. An example of Type II instance is shown in Figure 1 (b). However, previous studies (Fan et al., 2019; Wu et al., 2020; Pouran Ben Veyseh et al., 2020; Feng et al., 2021; Mensah et al., 2021; Jiang et al., 2021; Kang et al., 2021) trained and evaluated their models only on Type I instances and ignored Type II instances. The percentages of Type II instance in the four SemEval challenge datasets (Pontiki et al., 2014, 2015, 2016), which the four popular TOWE datasets only including Type I instances were built by Fan et al. (2019) based on, range from 9.05% to 32.33%. This indicates that there is a considerable amount of Type II instances and Type II instances should not be ignored.

Furthermore, as illustrated in Figure 2, ignoring Type II instances leads to a sample selection bias problem (Zadrozny, 2004; Ma et al., 2018). Specifically, TOWE models are trained with only Type I instances, while these models will be utilized to make inference on the entire space with both Type I and Type II instances. Thus, the generalization performance of trained models will be hurt. Moreover, the performance of these models on Type I instances cannot reflect their performance on entire space, i.e. real-world scenarios.

To validate the sample selection bias problem, four popular TOWE datasets containing only Type I instances are extended and additionally include Type II instances. Experimental results on these datasets show that training TOWE models on entire space will significantly improve model performance and evaluating TOWE models only on Type I instances will overestimate model performance.

2. Related Work

Target-oriented opinion words extraction (TOWE) extracts the corresponding opinion words from sentences for a given aspect term and is proposed by Fan et al. (2019). Moreover, Fan et al. (2019) built four TOWE datasets (i.e., Rest14, Lapt14, Rest15, and Rest16) based on four SemEval challenge datasets. The four SemEval challenge datasets include three restaurant datasets (i.e. Rest14, Rest15, and Rest16) from the SemEval Challenge 2014 Task 4 (Pontiki et al., 2014), SemEval Challenge 2015 task 12 (Pontiki et al., 2015), and SemEval Challenge 2016 task 5 (Pontiki et al., 2016), and a laptop dataset (i.e. Lapt14) from the SemEval Challenge 2014 Task 4. In the original SemEval challenge datasets, the aspect terms are annotated, but the opinion words and the correspondence with aspect terms are not provided. Thus Fan et al. (2019) annotated the corresponding opinion words for the annotated aspect terms. Note that, in the four TOWE datasets that Fan et al. (2019) built, only the sentences that contain pairs of aspect terms and opinion words are kept and only the aspect terms associated with at least one opinion term are used as instances. Fan et al. (2019) also proposed an Inward-Outward LSTM with Global context (IOG) for TOWE.

Later, several other models were proposed for TOWE. Pouran Ben Veyseh et al. (2020) proposed ONG including Ordered-Neuron Long Short-Term Memory Networks as well as GCN, and Jiang et al. (2021) proposed a novel attention-based relational graph convolutional neural network (ARGCN), both of which exploited syntactic information over dependency graphs to improve model performance. Feng et al. (2021) proposed Target-Specified sequence labeling with Multi-head Self-Attention for TOWE. Wu et al. (2020) leveraged latent opinions knowledge from resource-rich review sentiment classification datasets to improve TOWE task. Kang et al. (2021) concentrated on incorporating aspect term information into BERT. Mensah et al. (2021) conducted an empirical study to examine the actual contribution of position embeddings. These models obtained better performance. However, all these studies, following Fan et al. (2019), only used Type I instances to train and evaluate their models.

Table 1. Statistics of our new TOWE datasets. Ratio stands for the numbers of Type II instances to all instances in the dataset.

Datasets

#aspect

terms

Ratio (%)

Rest14-e

training

Type I instance

2138

28.35

Type II instance

846

validation

Type I instance

500

29.58

Type II instance

210

test

Type I instance

865

23.72

Type II instance

269

Lapt14-e

training

Type I instance

1304

32.33

Type II instance

623

validation

Type I instance

305

30.21

Type II instance

132

test

Type I instance

480

26.72

Type II instance

175

Rest15-e

training

Type I instance

864

9.05

Type II instance

validation

Type I instance

212

14.86

Type II instance

test

Type I instance

436

19.56

Type II instance

106

Rest16-e

training

Type I instance

1218

12.94

Type II instance

181

validation

Type I instance

289

15.99

Type II instance

test

Type I instance

456

25.49

Type II instance

156

Table 2. Performance of the models evaluated on entire space and Type I instances. All models are trained on Type I instances. Gains indicate how much higher the performance of the model evaluated on Type I instances is than the performance of the model evaluated on entire space. The best F1 score on entire space is marked in bold and the best F1 score on Type I instances is underlined.

Method	Test instance type	Rest14-e			Lapt14-e			Rest15-e			Rest16-e
Method	Test instance type	P	R	F1	P	R	F1	P	R	F1	P	R	F1
ARGCN	Entire space	71.70	81.30	76.17	60.99	69.48	64.90	68.10	75.90	71.72	69.85	81.91	75.37
	Type I instance	85.25	82.29	83.73	74.00	70.41	72.14	76.56	75.66	76.05	86.00	82.33	84.10
	(Gains (%))	18.89	1.22	9.93	21.32	1.33	11.15	12.43	-0.32	6.05	23.11	0.51	11.58
ARGCN_bert	Entire space	73.19	83.46	77.98	63.19	73.22	67.80	73.88	74.73	74.27	74.26	83.78	78.72
	Type I instance	86.00	83.01	84.46	75.79	76.02	75.88	78.21	75.74	76.92	86.91	84.05	85.45
	(Gains (%))	17.50	-0.54	8.31	19.95	3.83	11.92	5.87	1.35	3.58	17.03	0.32	8.54
IOG	Entire space	73.02	76.62	74.77	61.60	67.25	64.19	70.78	69.98	70.32	68.37	81.41	74.29
	Type I instance	82.80	78.66	80.64	72.43	69.63	70.96	77.19	69.86	73.29	85.67	79.81	82.58
	(Gains (%))	13.39	2.66	7.85	17.57	3.54	10.55	9.05	-0.17	4.23	25.30	-1.97	11.15
IOG_bert	Entire space	73.38	85.92	79.14	60.41	80.32	68.92	71.03	81.70	75.96	69.69	90.31	78.62
	Type I instance	86.50	85.81	\ul86.13	77.62	80.92	\ul79.22	78.88	81.70	\ul80.24	88.68	89.66	\ul89.16
	(Gains (%))	17.88	-0.14	8.84	28.50	0.75	14.95	11.05	0.00	5.64	27.25	-0.72	13.40

Table 3. Performance of the models trained on Type I instances and entire space. All models are evaluated on entire space. Gains indicate how much better the model trained on entire space is than the model trained on Type I instances. The best F1 score is marked in bold.

Method	Training-validation instance type	Rest14-e			Lapt14-e			Rest15-e			Rest16-e
Method	Training-validation instance type	P	R	F1	P	R	F1	P	R	F1	P	R	F1
ARGCN	Type I instance	71.70	81.30	76.17	60.99	69.48	64.90	68.10	75.90	71.72	69.85	81.91	75.37
	Entire space	81.02	78.29	79.58	73.29	64.94	68.82	75.12	73.14	74.11	75.61	81.91	78.61
	(Gains (%))	13.00	-3.70	4.48	20.16	-6.54	6.04	10.31	-3.63	3.33	8.23	0.00	4.29
ARGCN_bert	Type I instance	73.19	83.46	77.98	63.19	73.22	67.80	73.88	74.73	74.27	74.26	83.78	78.72
	Entire space	81.37	78.39	79.84	72.46	66.43	69.27	76.63	72.50	74.48	78.68	81.03	79.82
	(Gains (%))	11.17	-6.08	2.39	14.68	-9.27	2.17	3.73	-2.98	0.29	5.96	-3.28	1.39
IOG	Type I instance	73.02	76.62	74.77	61.60	67.25	64.19	70.78	69.98	70.32	68.37	81.41	74.29
	Entire space	75.78	74.68	75.19	71.08	62.91	66.71	75.81	67.38	71.29	75.66	77.75	76.68
	(Gains (%))	3.78	-2.53	0.56	15.39	-6.44	3.93	7.10	-3.71	1.38	10.66	-4.50	3.21
IOG_bert	Type I instance	73.38	85.92	79.14	60.41	80.32	68.92	71.03	81.70	75.96	69.69	90.31	78.62
	Entire space	81.64	80.87	81.24	71.64	75.24	73.35	77.82	76.35	76.99	75.19	87.75	80.97
	(Gains (%))	11.26	-5.88	2.66	18.60	-6.32	6.43	9.56	-6.55	1.37	7.89	-2.83	3.00

3. Experimental Setup

3.1. Datasets and Metrics

To validate the sample selection bias problem in TOWE modeling, we built four new TOWE datasets (i.e. Rest14-e, Lapt14-e, Rest15-e, and Rest16-e) containing both Type I and Type II instances based on four popular TOWE datasets (i.e. Rest14, Lapt14, Rest15, and Rest16) (Fan et al., 2019) only including Type I instances. In our dataset names, the letter e stands for entire space. Specifically, Fan et al. (2019) built the four TOWE datasets by annotating the corresponding opinion words for the annotated aspect terms in four SemEval challenge datasets (Pontiki et al., 2014, 2015, 2016). However, only the aspect terms associated with at least one opinion word were kept and used as TOWE instances by Fan et al. (2019). To build our new TOWE datasets, first, our new TOWE datasets include all instances in the TOWE datasets built by Fan et al. (2019). Then, the aspect terms in the original SemEval challenge datasets, which don’t have corresponding opinion words and hence were excluded from the TOWE datasets built by Fan et al. (2019), are added as a part of our new TOWE datasets. The statistics of our new TOWE datasets are shown in Table 1.

Following previous works (Fan et al., 2019), we adopt evaluation metrics: precision (P), recall (R), and F1-score (F1). An extraction is considered as correct only when the opinion words from the beginning to the end are all predicted exactly as the ground truth.

Table 4. Performance of IOG_bert trained on Type I instances and entire space. IOG_bert is evaluated on Type I instances. The best F1 score is marked in bold.

Method	Training-validation instance type	Rest14-e			Lapt14-e			Rest15-e			Rest16-e
Method	Training-validation instance type	P	R	F1	P	R	F1	P	R	F1	P	R	F1
IOG_bert	Type I instance	86.50	85.81	86.13	77.62	80.92	79.22	78.88	81.70	80.24	88.68	89.66	89.16
IOG_bert	Entire space	88.37	80.79	84.40	80.23	75.73	77.90	82.295	76.06	79.01	89.84	88.32	89.06

Table 5. Case study. Incorrect predictions are marked in red.

Sentence

Aspect term

Training-validation

instance type

Prediction

Ground truth

chef

Type I instance

[”not”]

[]

Even when the chef is not in the house, the food and service are right on target .

chef

Entire space

[]

orange donut

Type I instance

[”never”]

[]

I never had an orange donut before so I gave it a shot .

orange donut

Entire space

[]

Entrees

Type I instance

[”classics”]

[]

Entrees include classics like lasagna, fettuccine Alfredo and chicken parmigiana.

Entrees

Entire space

[]

rose roll

[”Try”]

Type I instance

[”Try”]

[]

rose roll

[]

[”Try”]

Try the rose roll (not on menu).

Entire space

[]

3.2. TOWE Models

We run experiments based on four TOWE models:

•

ARGCN (Jiang et al., 2021) first incorporates aspect term information by combining word representations with corresponding category embeddings with respect to the target tag of words. Then an attention-based relational graph convolutional network is used to learn semantic and syntactic relevance between words simultaneously. Finally, BiLSTM is utilized to capture the sequential information. Then obtained word representations are used to predict word tags $\{O,B,I\}$ .
•

ARGCN_bert (Jiang et al., 2021) is the BERT (Devlin et al., 2018) version of ARGCN. The last hidden states of the pre-trained BERT are adopted as word representations and BERT is fine-tuned jointly.
•

IOG (Fan et al., 2019) uses an Inward-Outward LSTM to pass aspect term information to the left context and the right context of the aspect term, and obtains the aspect term-specific word representations. Then, a Bi-LSTM takes the aspect term-specific word representations as input and outputs the global contextualized word representations. Finally, the combination of the aspect term-specific word representations and the global contextualized word representations is used to predict word tags $\{O,B,I\}$ .
•

IOG_bert is the BERT version of IOG. Specifically, the word embedding layer and Inward-Outward LSTM in IOG are replaced with BERT. Moreover, BERT takes “[CLS] sentence [SEP] aspect term [SEP]” as input.

We run all models for 5 times and report the average results on the test datasets.

4. Results

4.1. Evaluation on Entire Space

To observe the difference of TOWE model performance on Type I instances and entire space, the four models are trained on Type I instances (Both the training set and validation set only include Type I instances) like previous studies (Fan et al., 2019), then are evaluated on both Type I instances and entire space. Experimental results are shown in Table 2. We can see that all models across all four datasets obtain much better performance on Type I instances than on entire space in terms of F1 score. For example, The performance gains of the best model $IOG_{bert}$ on Rest14-e, Lapt14-e, Rest15-e and Rest16-e are 8.84%, 14.75%, 5.64% and 13.40%, respectively. The increase on Rest15-e is smallest, since the ratio of Type II instances in Rest15-e is smallest (Table 1). Whatever, evaluating TOWE models on Type I instances will overestimate model performance.

4.2. Training on Entire Space

In this section, the four models are trained in two settings: (1) both the training set and validation set only include Type I instances, and (2) both the training set and validation set include both Type I and Type II Instances (i.e. entire space). Then the trained models are evaluated on entire space. Experimental results are shown in Table 3. From Table 3 we draw the following two conclusions. First, all models trained on entire space outperform them trained on Type I instances in terms of F1 score across all four datasets, indicating that training models on entire space can improve the generalization performance of trained models. Second, while models trained on entire space obtain better precision, models trained on Type I instances obtain better recall. The reason is that the additional instances in entire space, i.e. Type II instances, only contain aspect terms without corresponding opinion words and hence help TOWE models to exclude incorrect opinion words for aspect terms, but also exclude some correct opinion words.

$IOG_{bert}$ trained on Type I instances and entire space is also evaluated on Type I instances. The results are shown in Table 4. We can see in Table 4 that $IOG_{bert}$ trained on Type I instances obtains better performance than $IOG_{bert}$ trained on entire space. This indicates that it is necessary to design models which work well on entire space and we leave this for future research.

4.3. Case Study

To further understand the impact of Type II instances on TOWE models, we show the predictions of IOG_bert (the best TOWE model in our experiments) trained on Type I instance and entire space on four sentences. The four sentences are from the test set of the Rest14-e dataset. All the first three sentences contain only one aspect term and all the three aspect terms are Type II instances. While IOG_bert trained on entire space makes correct inferences on the three instances, IOG_bert trained on Type I instances erroneously extracts opinion words for the three aspect terms. In fact, the words that IOG_bert trained on Type I instances extracts are not opinion words. The reason is that the entire-space training set of Rest14-e additionally includes Type II instances, some of which are even similar to the instances appearing in the first three sentences. For example, the sentence “food was delivered by a busboy, not waiter” and the aspect term “waiter” in the sentence is a Type II instance in the training set of the Rest14-e. This instance may suggest that IOG_bert extract nothing for the aspect term “chef” from the first sentence. Thus, it is essential to train TOWE models on entire space. From the predictions of the fourth sentence, we can see that IOG_bert trained on Type I instances prefers to extract more opinion words, while IOG_bert trained on entire space prefers to extract less opinion words. This is a case that show why the TOWE models trained on entire space obtain lower Recall than them trained on Type I instances.

5. CONCLUSION

In this paper, we explore the sample selection bias problem in target-oriented opinion words extraction (TOWE) modeling. Specifically, we divide TOWE instances into two types: Type I instances where the aspect terms are associated with at least one opinion word and Type II instances where the aspect terms don’t have opinion words. Previous studies only use Type I instances to train and evaluate their models, resulting in a sample selection bias problem. Training TOWE models only on Type I instances may hurt the generalization performance of TOWE models. Evaluating TOWE models only on Type I instances can’t reflect the performance of TOWE models on real-world scenarios. To validate our hypotheses, we add Type II instances to previous TOWE datasets only including Type I instances. Experimental results on these datasets demonstrate that training TOWE models on entire space including both Type I instances and Type II instances will significantly improve model performance and evaluating TOWE models only on Type I instances will overestimate model performance.

Since the datasets used by aspect sentiment triplet extraction (ASTE) (Peng et al., 2020) are constructed based on the TOWE datasets built by Fan et al. (2019), hence may also have the sample selection bias problem. In the future, we will explore the sample selection bias problem on ASTE.

References

(1)
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
Fan et al. (2019) Zhifang Fan, Zhen Wu, Xinyu Dai, Shujian Huang, and Jiajun Chen. 2019. Target-oriented opinion words extraction with target-fused neural sequence labeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2509–2518.
Feng et al. (2021) Yuhao Feng, Yanghui Rao, Yuyao Tang, Ninghua Wang, and He Liu. 2021. Target-specified Sequence Labeling with Multi-head Self-attention for Target-oriented Opinion Words Extraction. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1805–1815.
Hu and Liu (2004) Minqing Hu and Bing Liu. 2004. Mining and Summarizing Customer Reviews. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Seattle, WA, USA) (KDD ’04). Association for Computing Machinery, New York, NY, USA, 168–177. https://doi.org/10.1145/1014052.1014073
Jiang et al. (2021) Junfeng Jiang, An Wang, and Akiko Aizawa. 2021. Attention-based Relational Graph Convolutional Network for Target-Oriented Opinion Words Extraction. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Association for Computational Linguistics, Online, 1986–1997. https://www.aclweb.org/anthology/2021.eacl-main.170
Kang et al. (2021) Taegwan Kang, Minwoo Lee, Nakyeong Yang, and Kyomin Jung. 2021. RABERT: Relation-Aware BERT for Target-Oriented Opinion Words Extraction. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 3127–3131.
Liu (2012) Bing Liu. 2012. Sentiment analysis and opinion mining. Synthesis lectures on human language technologies 5, 1 (2012), 1–167.
Ma et al. (2018) Xiao Ma, Liqin Zhao, Guan Huang, Zhi Wang, Zelin Hu, Xiaoqiang Zhu, and Kun Gai. 2018. Entire space multi-task model: An effective approach for estimating post-click conversion rate. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 1137–1140.
Mensah et al. (2021) Samuel Mensah, Kai Sun, and Nikolaos Aletras. 2021. An Empirical Study on Leveraging Position Embeddings for Target-oriented Opinion Words Extraction. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 9174–9179. https://doi.org/10.18653/v1/2021.emnlp-main.722
Nasukawa and Yi (2003) Tetsuya Nasukawa and Jeonghee Yi. 2003. Sentiment Analysis: Capturing Favorability Using Natural Language Processing. In Proceedings of the 2nd International Conference on Knowledge Capture (Sanibel Island, FL, USA) (K-CAP ’03). Association for Computing Machinery, New York, NY, USA, 70–77. https://doi.org/10.1145/945645.945658
Peng et al. (2020) Haiyun Peng, Lu Xu, Lidong Bing, Fei Huang, Wei Lu, and Luo Si. 2020. Knowing what, how and why: A near complete solution for aspect-based sentiment analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 8600–8607.
Pontiki et al. (2016) Maria Pontiki, Dimitris Galanis, Haris Papageorgiou, Ion Androutsopoulos, Suresh Manandhar, Mohammad AL-Smadi, Mahmoud Al-Ayyoub, Yanyan Zhao, Bing Qin, Orphée De Clercq, Véronique Hoste, Marianna Apidianaki, Xavier Tannier, Natalia Loukachevitch, Evgeniy Kotelnikov, Nuria Bel, Salud María Jiménez-Zafra, and Gülşen Eryiğit. 2016. SemEval-2016 Task 5: Aspect Based Sentiment Analysis. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016). Association for Computational Linguistics, San Diego, California, 19–30. https://doi.org/10.18653/v1/S16-1002
Pontiki et al. (2015) Maria Pontiki, Dimitris Galanis, Haris Papageorgiou, Suresh Manandhar, and Ion Androutsopoulos. 2015. SemEval-2015 Task 12: Aspect Based Sentiment Analysis. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015). Association for Computational Linguistics, Denver, Colorado, 486–495. https://doi.org/10.18653/v1/S15-2082
Pontiki et al. (2014) Maria Pontiki, Dimitris Galanis, John Pavlopoulos, Harris Papageorgiou, Ion Androutsopoulos, and Suresh Manandhar. 2014. SemEval-2014 Task 4: Aspect Based Sentiment Analysis. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014). Association for Computational Linguistics, Dublin, Ireland, 27–35. https://doi.org/10.3115/v1/S14-2004
Pouran Ben Veyseh et al. (2020) Amir Pouran Ben Veyseh, Nasim Nouri, Franck Dernoncourt, Dejing Dou, and Thien Huu Nguyen. 2020. Introducing Syntactic Structures into Target Opinion Word Extraction with Deep Learning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 8947–8956. https://doi.org/10.18653/v1/2020.emnlp-main.719
Wu et al. (2020) Zhen Wu, Fei Zhao, Xin-Yu Dai, Shujian Huang, and Jiajun Chen. 2020. Latent Opinions Transfer Network for Target-Oriented Opinion Words Extraction. arXiv preprint arXiv:2001.01989 (2020).
Zadrozny (2004) Bianca Zadrozny. 2004. Learning and evaluating classifiers under sample selection bias. In Proceedings of the twenty-first international conference on Machine learning. 114.