\jyear

2021

[1]\fnmYongdong \surXu \equalcontThese authors contributed equally to this work.

1]\orgdivschool of Computer science and technology, \orgnameHarbin Institute of Technology at WeiHai, \orgaddress\street2# WenHuaXi Road, \cityWeiHai City, \postcode264209, \stateShanDong Province, \countryChina

Adaptive Prompt Learning-based Few-Shot Sentiment Analysis

\fnmPengfei \surZhang \fnmTingting \surChai [email protected] [

Abstract

In the field of natural language processing, sentiment analysis via deep learning has a excellent performance by using large labeled datasets. Meanwhile, labeled data are insufficient in many sentiment analysis, and obtaining these data is time-consuming and laborious. Prompt learning devotes to resolving the data deficiency by reformulating downstream tasks with the help of prompt. In this way, the appropriate prompt is very important for the performance of the model. This paper proposes an adaptive prompting(AP) construction strategy using seq2seq-attention structure to acquire the semantic information of the input sequence. Then dynamically construct adaptive prompt which can not only improve the quality of the prompt, but also can effectively generalize to other fields by pre-trained prompt which is constructed by existing public labeled data. The experimental results on FewCLUE datasets demonstrate that the proposed method AP can effectively construct appropriate adaptive prompt regardless of the quality of hand-crafted prompt and outperform the state-of-the-art baselines.¹¹1Our implementation is publicly available at https://github.com/simonZPF/AP.

keywords:

Natural language processing, Sentiment analysis, Adaptive prompt learning, Seq2seq-attention

1 Introduction

Nowadays, deep learning (DL) has been widely used in image, voice, text and other fields to solve all kinds of problems and get excellent results. At the same time, the model effectiveness depends on large-scale high-quality labeled data which is insufficient. In addition, manual labeled large-scale data is time-consuming and laborious, it is difficult to obtain desirable labeled data to train the model. In order to address data acquisition issue, large-scale unsupervised data and a small amount of supervised data is the first choice for learning, such as semi-supervised learning method. Besides, learning general features from large-scale data, and then adjust them on specific tasks, such as fine-tune pre-trained model and prompt learning. In this work, an adaptive prompt method (AP) by introducing seq2seq-attention structure is proposed to achieve state-of-the-art performances in low resource tasks. In addition, the ability of prompt construction can be further improved by pre-training on the existing labeled datasets in other fields.

2 Related work

2.1 Sentiment analysis

Sentiment analysis originates from the analysis of subjectivity in sentencesbib1 . Due to the emergence of a large number of network resources, the research of sentiment analysis has become an active field since 2000bib2 . Early sentiment analysis mainly focused on building an sentiment dictionary for text classification. It was constructed manually by summarizing words containing sentiment tendencies, and labeling the sentiment polarity and intensity of these words to varying degrees. Therefore, it is necessary to build a high-quality sentiment dictionarybib3 . Due to the flexibility and non-standard of language, it is difficult to construct a general and efficient rule applicable to all contexts. Machine learning based sentiment analysis mainly relies on NLP researchers or engineers to use their domain knowledge to define and extract significant features from the original data, such as n-gram features, and then use traditional machine learning classifiers such as support vector machine, naive Bayes and maximum entropy for supervised learningbib4 .Li, Gbib5 builds a model with the prior knowledge of the categorization information in order to extract meaningful features from the unstructured texts by using TF-IDF, short for term frequency-inverse document frequency.

In recent years, with the development of deep learning theory, neural network has gradually matured in the field of sentiment analysis. Deep neural network can effectively capture the high-level semantic information of text without complex feature engineering, and the expression ability index of the model is times better than that of the shallow model. Among them, convolutional neural network and recurrent neural network are the most widely usedbib6 .Li, Dbib7 proposed the BLSTM and CNN Stacking Architecture (BCSA) to enhance the ability to recognition emotions. Besides, Chenbib8 propose HUSN which have a novel sentiment classification algorithm that utilizes user’s review habits to enhance hierarchical neural networks.Sadr, Hbib9 proposed model employs recursive neural network due to its tree structure as a substitute of pooling layer in the convolutional network with the aim of capturing long-term dependencies and reducing the loss of local information.

2.2 Pre-trained model

The purpose of pre-trained language models (PLMs) is to use a large number of texts that have appeared in people’s life to train the model, so that the model can learn the probability distribution of each word in these texts, so as to model the model that conforms to these text distributions.

Traditional PLMs technology aims to to learn word embedding. Because downstream tasks no longer need to use these models, they are usually very low in computational efficiency, such as skip grambib10 and glovebib11 . Although these pre-trained word vectors can capture the semantic meaning of words, they are context independent and can not capture the high-level concepts of text, such as grammar and semantics. Elmobib12 proposed a context sensitive text representation method, which constructs the text representation through the deep bidirectional language model, which effectively solves the problem of polysemy. In 2018, Devlin et albib13 . proposed BERT (bidirectional encoder representations from transformers) pre-trained language model. The model trains massive corpus through bidirectional transformer encoder and uses masked language model (MLM) to generate in-depth bidirectional language representation. After pre-training, you only need to add an additional output layer for fine-tuning to achieve the performance of state of the art in a variety of downstream tasks. In this process, there is no need to make task specific structural modifications to Bert.

Bert has opened a new era, and a large number of pre-trained language models have emerged since then. For example, Robertabib14 retains the original Bert architecture with longer training time, larger batch, longer sequence and more data. At the same time, delete the prediction of the next sentence and use dynamic masking. Albertbib15 solves the problems of higher memory consumption and slow Bert training speed. ERNIEbib16 introduced the knowledge mask strategy, including entity level mask and phrase level mask, to replace the random mask in Bert. In addition,Bert also can be used in other languages. Farahanibib17 proposed a monolingual Bert for the Persian language which is lighter than the original multilingual Bert model. There is Bert-wwmbib18 for Chinese which is not only a continuous mask of entity words and phrases, but also a continuous mask of all words that can form Chinese words.

2.3 Prompt Method

With the increasing volume of pre-trained language model, the hardware requirements, data requirements and actual cost of fine-tune are also rising. In addition, the rich and diverse downstream tasks also make the design of pre-training and fine-tuning stage cumbersome and complex. Therefore, researchers hope to explore smaller, lighter, more universal and efficient methods. Prompt method is an attempt in this direction which include hand-crafted prompt method and automated prompt. Prompt method is a technology that adds additional text to the input segment in order to better use the knowledge of the pre-trained language model. Schick Tbib19 et al. designed Pattern Exploiting Training(PET), which is a semi-supervised training task. The input example is redefined as the phrase of cloze to help the language model understand the given task. Jiang et al.bib20 proposed a mining based method, which can automatically find a given set of training input and output templates. This method finds the intermediate word or dependency path between input and output in a large text corpus containing input and output strings, and uses the frequently occurring intermediate word or dependency path as a template.

Davison et al.bib21 designed an input (head relation tail) template using LM by studying the tasks related to the knowledge base. Liu x et al.bib22 proposed a method called P-tuning, abandoned the conventional requirement that “the template is composed of natural language", used the token never seen in the model to form the template, transformed the construction of the template into a continuous parameter optimization problem, and realized the automatic construction of the template.

In the hand-crafted prompt method, the accuracy of the model depends very much on the quality of the constructed template, and the effects of different templates may vary greatly. For some tasks, it is not so easy to discover an optimal prompt manually. In the automated prompt method, the prompt is constructed by the model and does not rely on manual work. But both of those method can not use the semantic information of input texts in prompt construction process.

To solve the above problems, this paper proposes a template construction method by introducing seq2seq-attention structure which can dynamically generate matching template vectors and makes full use of the original text information. At the same time, with the idea of pre-trained model, we design a template construction strategy based on pre-training, which can make full use of the public sentiment analysis datasets of high resources fields and apply it to the in other field of low resources.

The main contributions of this paper include three aspects:

[1.]
1.

We propose an adaptive prompt method by introducing seq2seq-attention structure. This method has the advantages of both hand-crafted prompt and automated prompt and can make full use of semantic information of input text.
2.

The experimental results on the FewCLUE datasets show that the proposed method is effective in the sentiment analysis few-shot task.
3.

We proposed to pre-train the adaptive prompt module in high resources tasks, and migrate to or fine-tune in low resource tasks which can effectively play a significant effect in low resource tasks.

3 Methodology

In this section, we propose an adaptive prompting method based on seq2seq-attention(AP) and introduce its implementation. We introduce seq2seq-attention structure to generate adaptive template from input, and then use the pre-training model to realize sentiment analysis.

3.1 Adaptive prompt learning

Our work is based on adaptive prompt learning model, which improve traditional hand-crafted prompt learning method (HPL). The HPL model includes input layer, hand-crafted prompt, pre-trained language model and output layer.

Given a pre-trained model M, vocabulary V, a input sequence X of length n:{x₁,x₂…x_n}, verbalizer W:{w₁,w₂…w_l}, a hand-crafted prompt P:{p₁,p₂…p_i,[MASK],p_i+1…p_m}, where value of [MASK] comes form W. Firstly, The sequence input X, and manual prompt P form a template t. Then, the template t will be mapped into e(t): {e(p₁),e(p₂)…e(p_i),e([MASK]),e(p ${}_{i+1})$ …e(p_m),e(x₁),e(x₂)…e(x_n)},by pre-trained model embedding layer e(where each token p₁ will be mapped into e(p₁)).After that, the pre-trained model M is used to calculate the probability values of [MASK] in e([MASK]) to select the best word in W which has the maximum probability. For example, for the sentiment calculation of “The weather is very good", hand-crafted prompt “it is [MASK] ", verbalizer {“good”, “bad”}, and then the traditional prompt model will construct the template “It is [MASK], The weather is very good. " Finally, M will return the predictive value. As shown in Figure 1(a).

Refer to caption — (a) hand-crafted prompt method

While the strategy of traditional hand-crafted prompt learning is intuitive and work well in some sentiment tasks, there are also two issues with this approach. 1) It is hard for human to discover optimal prompts in all tasks. 2)Even if in the same task, the optimal prompts of different input sequences are different. Usually, it is hard for algorithms to dynamically found “best” prompt for a special input sequences in task. We consider automatic adaptive prompt design to solve the defects of manual prompt design. As shown in figure 1(b), we use adaptive prompt layer to generate prompt instead of manual craft prompt. In order to strengthen the relationship between the prompt and the input sequence X, we consider to use the text information of the input sequence X to automatically generate the corresponding prompt, that is, generate an adaptive prompt. (in traditional automated method, there is no direct correlation between prompt and input X, we trained the adaptive prompt using the context information of input X). In this work, we use seq2seq-attention structure as adaptive prompt layer to generate adaptive prompt. The seq2seq structure can generate target sequence (prompt sequence) by a specific method according to a source sequence(words embedding vectors of input X) which are suitable for generating prompt sequence. Meanwhile,an attention structure is introduced to better capture the semantic details of input X as shown in Figure 2.

In general, seq2seq-attention structure as adaptive prompt layer to generate a vector sequence h with a length of s consist the adaptive prompt from input X. And each vector dimension of the sequence is consistent with the output vector dimension of the embedding layer of the pre-trained model. Meanwhile, the adaptive prompt’s embedding vectors are continuous which enables us to find a better continuous prompt beyond the original vocabulary could expressbib22 . In addition, adaptive prompt layer can capture text information of input x by attention structure(the yellow part in the figure 2), which can make the generated prompt fitter with the input text.

3.2 Hybrid prompt learning

Although automated prompt has various advantages in most tasks, such as wide application range, strong generalization ability, stable and balanced performance, the algorithm may falls into local optimal solution in many cases. The hand-crafted prompt has excellent performance in some cases, but this method requires experienced expert participation and is unstable. In order to combine the advantages of the two methods, we design a Hybrid prompt composed of a hand-crafted part and automated part as shown in Figure 3.

The hybrid prompt model combines the word vector generated by hand-crafted prompt and the prompt layer. Through the hybrid prompt embedding layer, the template can be represented a triple $<$ X, P, h $>$ = {e(p₁),e(p₂)…e(p_i),e([MASK]),e(p ${}_{i+1})$ …e(p_m), h₁,h₂…h_s, e(x₁),e(x₂)…e(x_n)}

where P is the hand-crafted prompt, h is the adaptive prompt embedding vector sequence , X is the input text sequence. And the final model prediction result set $y=\{p(i),i\in W\}$ is calculated by pre-trained model.

Both adaptive prompt and hand-crafted prompt effect the final result y. The result showed that(4. 4. 1 for details), the model can learn to adjust the weights of P and h to generate better output results. Therefore, theoretically, the model has the advantages of both hand-crafted prompt and automatic prompt. When a “good” hand-crafted prompt can be found, the model effect can be further improved. Even if a “good” hand-crafted cannot be found, an excellent prompt can be generated by the adaptive prompt part.

4 Experiments and results

4.1 Database

We evaluate AP mainly on the FewCLUE public datasets EPRSTMT(E-commerce Product Review Dataset for Sentiment Analysis) task which is labelled as Positive or Negative and collect by ICIP Lab of Beijing Normal University. The datasets used in migration experiment include social media public sentiment datasets(more than 100000 data with emotional labels on Sina Weibo, and about 50000 positive and negative comments respectively), hotel comment data(more than 7000 hotel review data, more than 5000 positive and 2000 negative reviews), user comments data by a take-out platform( 4000 positive and 8000 negative user comments collected by a takeout platform), online shopping data which have 7 categories (books, fruits, shampoo, water heater,milk, clothes and hotels) and more than 60000 comments in total, with about 30000 positive and negative comments respectively. The data in English field include 7000 movie datasets with about 3500 positive and 3500 negative data respectively.

4.2 Hyper-parameters setting

In order to fully obtain all the information of the sentence, this paper sets the maximum length of the sentence to twice the length of the digits in the data set. In the experiment, the pre-trained model adopts Roberta-wwm-ext as pre-trained language model. The batch size value is set to 5 and output length of adaptive prompt layer is set to 2 in Chinese and 4 in English, The model adopts Adam optimizer and adopts different learning rates for different optimization methods.

4.3 Optimization strategy

The model consists of two trainable parts, one is the pre-trained model parameters, the other is the seq2seq-attention parameters, that is, the adaptive prompt layer. Based on this, our optimization methods can be divided into two categories: one is to fine-tune all parameters (prompt+LM tuning).In this setting, there are prompt-relevant parameters, which can be fine-tuned together with the all or some of the parameters of the pre-trained modelsbib23 . And the other is to fine-tune only the seq2seq-attention part (fixed LM prompt tuning). In the scenario where additional prompt-relevant parameters are introduced besides parameters of the pre-trained model, fixed-LM prompt tuning updates only the prompts’ parameters using the supervision signal obtained from the downstream training samples, while keeping the entire pre-trained LM unchangedbib23 .

4.4 Results analysis

4.4.1 Prompt+LM Tuning

In this experiment, we use the method of full model parameter adjustment to test the datasets from the FewCLUE in Chinese (32 training sets and about 600 test sets) and the datasets of movie field in English(32 training sets and 600 test sets). In this method, the fine-tuning of pre-traind model plays a leading role in the overall model training, and the seq2seq part plays an auxiliary role. That is, the hand-crafted prompt plays a major role, while the automated template is a supplement and enhancement to the hand-crafted prompt. In this case, the learning rate is 1e-5. The results are shown in Tables 1 and 2.

Table 1: Accuracy of different methods under different hand-crafted prompt in Chinese datasets

Prompt	Zero-Shot	HPL	AP
__开心(happy)	75.7%	83.8%	84.1%
__高兴(glad)	70.2%	79.3%	83.9%
__好(good)	64.9%	80.3%	82.8%
__行(OK)	51.8%	78%	82.3%
\botrule

Table 2: Accuracy of different methods under different hand-crafted prompt in English datasets

Prompt	Zero-Shot	HPL	AP
It was __.	62.3%	72.7%	77.8%
Just __!	60.2%	75.7%	78.3%
It makes me feel __ that	72%	74.5%	80%
\botrule

Among them, the zero-shot method only uses prompt to construct the template, and then predicts through the pre-trained model without fine-tuning the parameters of the pre-trained model. We use the results of zero-shot to judge the quality of hand-crafted prompt. In this case, it can be seen that the quality of hand-crafted prompt has a greater impact on HPL method, but less impact on AP method. AP method can also have a higher accuracy when the hand-crafted prompt is not good. On the other hand, when the hand-crafted prompt is good, AP model can also play a better role than HPL model. Thus, the model can learn to adjust the weights of hand-crafted prompt and adaptive prompt to return better results.

4.4.2 Fixed-LM Prompt Tuning

In order to further test the ability of the seq2seq-attention part of the AP method, we canceled the hand-crafted prompt in this part of the experiment, only used seq2seq-attention to generate the prompt, and frozen the parameters of the pre-trained language model. Therefore, the goal of seq2seq-attention structure is to learn the embedding representation of adaptive prompt in pre-trained language model, makes h behaves like the sequence of real text through the embedding layer.

We designed experiments on large-scale data (microblog data) and small samples (FewCLUE data). In the experiments on large-scale data, the accuracy of the model is more than 92%, while in the small sample data, the accuracy of the model is only about 65%.

For the good performance of the experiment in the case of large data and the poor performance of small sample data, we consider that in the case of large data, due to the sufficient samples, we can learn the adaptive prompt and its embedding representation in pre-trained model through seq2seq-attention structure. In the case of insufficient samples, seq2seq-attention structure is difficult to learn two parts at the same time, therefore, resulting in over fitting.

Embedded representation can only be learned under large-scale data. And this experiment has shown that the model has the ability to learn adaptive prompt with sufficient samples. Therefore, in order to verify that the seq2seq-attention structure can learn the generality of embedded representation of pre-trained model and adaptive prompt, we have done migration experiment.

4.4.3 Migration Experiment

Sentiment analysis includes different fields, such as catering, e-commerce and film. Although there are some differences between these fields, they are generally a classification of sentiment. The reason why the past models can not be used directly across fields is that the words and language structures used for emotional expression in different fields are very different, resulting in different parameters of word vector layer and full connection layer. Therefore, a good automated prompt construction structure should be able to learn the adaptive prompt in the general field and perform well in the unknown field.

In this experiment, we set up a mixed data experiment. In this experiment, the training set mixes the sentiment analysis data sets of 7 categories (books, fruits, shampoo, water heater, milk, clothes and hotels) of online shopping field, microblog field, takeout field, hotel field and uses the datasets of e-commerce field (FewCLUE) as the test set. In this case, the learning rate is 2e-6, The result of each epoch as shown in figures 5.

It can be seen that the model can learn to construct general adaptive prompt in the mixed field, and can play a good effect in other fields, with an better accuracy of 88.7% , much higher than the results in 4.1.1 and other method on FewCLUE datasets. As shown in table 3.

Table 3: Main results of different learning mechanisms on FewCLUE. Values with * are retrieved from Xu et al. bib24

Method	FineTuning	PET	LM-BFF	P-tuning	EFL	AP
Accuracy	65.4%*	86.7%*	85.6%*	88.3%*	84.9%*	88.7%
\botrule

4.4.4 Pre-train Experiment

Pre-training is an application of transfer learning. It uses almost unlimited text to learn the context sensitive representation of each member of the input sentence. It implicitly learns the general grammatical and semantic knowledge, and migrates the knowledge learned from the open domain to the downstream tasks to improve the low resource tasks. We hope to learn the general expression of adaptive prompt from the large-scale sentiment analysis data set through pre-training, so as to better solve the sentiment classification in the case of small samples. In this experiment, we pre-train the model in the sentiment analysis data sets of microblog field, takeout field, hotel field and online shopping field. And fine-tuning in the movie field (32 training sets and 600 test sets), in this case, the learning rate is 5e-6. The experimental results are shown in Table 4.

Table 4: Comparison of experimental results

PET	AP	Pre-AP
67.8%(avg)	69.8%(avg)	78.2%
\botrule

The results show that the accuracy of the model is much higher than that of PET and AP models, which means the pre-training method is feasible in AP model.

5 Conclusion

This paper introduces the method of sentiment analysis, analyzes the shortcomings of prompt learning, and proposed the adaptive prompt model. The advantages of the model can be summarized as follows:

[1.]
1.

The hand-crafted prompt and automated prompt are combined in the model.
2.

Seq2seq-attention structure is introduced to make full use of context information to generate adaptive prompt.
3.

The proposed model AP learns to construct a general adaptive prompt by using the sentiment analysis data set in sufficient samples fields.
4.

Pre-trained prompt method in the field of sentiment analysis is proposed.

Future research work will be carried out in-depth research from the following aspects to provide directions for further improving the performance of the model: 1) find a better parameter fine-tuning method based on pre-trained prompt;2) This method is extended to other fields of natural language processing, such as text classification, machine reading comprehension.

References

(1) Development and use of a gold standard data set for subjectivity classifications[C]// Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL-99).Association for Computational Linguistics,1999.
(2) Tong R M .An operational system for detecting and tracking opinions in on-line discussion[J].proceedings of the acm sigir workshop on operational text classification,2001.
(3) Xue Y , Li Q , Jin L ,et al.Detecting Adolescent Psychological Pressures from Micro-Blog[C]// International Conference on Health Information Science.Springer,Cham,2014.
(4) Lafferty J , Mccallum A , Pereira F C N .Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[J].proceedings of icml,2002.
(5) Li G , Lin Z , Wang H , et al. A Discriminative Approach to Sentiment Classification[J]. Neural Processing Letters, 2020, 51(2).
(6) Lecun Y , Bengio Y , Hinton G .Deep learning[J].Nature,2015,521(7553):436.
(7) Li D , Sun L , Xu X , et al. BLSTM and CNN Stacking Architecture for Speech Emotion Recognition[J]. Neural Processing Letters, 2021(1).
(8) Chen J , Yu J , Zhao S , et al. User’s Review Habits Enhanced Hierarchical Neural Network for Document-Level Sentiment Classification[J]. Neural Processing Letters, 2021(2).
(9) Sadr H , Pedram M M , Teshnehlab M . A Robust Sentiment Analysis Method Based on Sequential Combination of Convolutional and Recursive Neural Networks[J]. Neural Processing Letters, 2019, 50(6).
(10) Le Q V , Mikolov T .Distributed Representations of Sentences and Documents.JMLR.org,2014.
(11) Pennington J , Socher R , Manning C .Glove: Global Vectors for Word Representation[C]// Conference on Empirical Methods in Natural Language Processing.2014.
(12) C.Clark,K.Lee,and L.Zettlemoyer,“Deep contextualized word representations,”
(13) Devlin J , Chang M W , Lee K ,et al.BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[J].2018.
(14) Liu Y , Ott M , Goyal N ,et al.RoBERTa: A Robustly Optimized BERT Pretraining Approach[J].2019.
(15) ALBERT: A Lite BERT For Self-Supervised Learning Of Language Representations
(16) Zhang Z , Han X , Liu Z ,et al.ERNIE: Enhanced Language Representation with Informative Entities[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019.
(17) Farahani M , Gharachorloo M , Farahani M , et al. ParsBERT: Transformer-based Model for Persian Language Understanding[J]. arXiv, 2020.
(18) Cui Y , Che W , Liu T ,et al.Pre-Training with Whole Word Masking for Chinese BERT[J].2019.
(19) Schick T ,H Schütze.Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference[J].2020.
(20) Jiang Z , Xu F F , Araki J ,et al.How Can We Know What Language Models Know?[J].2019.
(21) Feldman J , Davison J , Rush A M .Commonsense Knowledge Mining from Pretrained Models[C]// arXiv.arXiv,2019.
(22) Liu X , Zheng Y , Du Z ,et al.GPT Understands,Too[J].2021.
(23) Liu P , Yuan W , Fu J , et al. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing[J]. 2021.
(24) Xu L , Lu X , Yuan C ,et al.FewCLUE: A Chinese Few-shot Learning Evaluation Benchmark[J].2021.