This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Appendix - Structured Self-Attention Weights
Encode Semantics in Sentiment Analysis

Zhengxuan Wu1, Thanh-Son Nguyen2, Desmond C. Ong2,3
1Symbolic Systems Program, Stanford University
2Institute of High Performance Computing, Agency for Science, Technology and Research, Singapore
3Department of Information Systems and Analytics, National University of Singapore
[email protected], [email protected],
[email protected]

1 Evaluation Metrics

Concordance Correlation Coefficient (CCC lin1989concordance):

The CCC of vectors XX and YY is:

CCC(X,Y)\displaystyle\text{CCC}(X,Y) 2Corr(X,Y)σXσYσX2+σY2+(μXμY)2\displaystyle\equiv\frac{2\text{Corr}\left(X,Y\right)\sigma_{X}\sigma_{Y}}{\sigma_{X}^{2}+\sigma_{Y}^{2}+\left(\mu_{X}-\mu_{Y}\right)^{2}} (1)

where Corr(X,Y)cov(X,Y)/(σXσY)\text{Corr}\left(X,Y\right)\equiv\text{cov}(X,Y)/(\sigma_{X}\sigma_{Y}) is the Pearson correlation coefficient, and μ\mu and σ\sigma denotes the mean and standard deviation respectively.

2 Experiment Setup

Computing Infrastructure:

To train our models, we use a single Standard NV6 instance on Microsoft Azure. The instance is equipped with a single NVIDIA Tesla M60 GPU.

Average Runtime:

With the computing infrastructure, it takes about 1.5 hrs to train both models, where each model is trained with 200 epochs. Both model reaches maximum performances in about 1.5 hrs at about 100 epochs.

Number of Trainable Parameters:

The model trained on SST-5 that uses a LSTM decoder has 3,993,222 parameters. Additionally, the model trained on SEND that uses a MLP decoder has 4,715,362 parameters.

3 Task-specific Decoders

Long-Short Term Memory Network (LSTM):

For the time-series task, we use a LSTM layer hochreiter1997long to decode the context vector cic_{i} from our encoder for each window ii to output a hidden vector hih_{i}. Then, the hidden vector passes through a MLP to make the valence prediction:

hi\displaystyle h_{i} =LSTM(hi1,ci)\displaystyle=\text{LSTM}(h_{i-1},c_{i}) (2)
r^i\displaystyle\hat{r}_{i} =MLP(hi)\displaystyle=\text{MLP}(h_{i}) (3)

Multilayer Perceptron (MLP):

Our MLP contains 3 consecutive linear layers with a single ReLU activation in between layers. For the classification task, we feed in the context vector cc from our encoder to MLP to make the sentiment prediction:

f1(c)\displaystyle f_{1}(c) =ReLU(W1c+b1)\displaystyle=\text{ReLU}(\textbf{W}_{1}c+\textbf{b}_{1}) (4)
r^i\displaystyle\hat{r}_{i} =W3[ReLU(W2f1(c)+b2)]+b3\displaystyle=\textbf{W}_{3}\left[\text{ReLU}(\textbf{W}_{2}f_{1}(c)+\textbf{b}_{2})\right]+\textbf{b}_{3} (5)

where W1,W2,W3,b1,b2,b3\textbf{W}_{1},\textbf{W}_{2},\textbf{W}_{3},\textbf{b}_{1},\textbf{b}_{2},\textbf{b}_{3} are learnable parameters for linear layers. For the time-series task, the hidden vector hth_{t} is the input instead.