Appendix - Structured Self-Attention Weights
Encode Semantics in Sentiment Analysis

Zhengxuan Wu¹, Thanh-Son Nguyen², Desmond C. Ong^2,3
¹Symbolic Systems Program, Stanford University
²Institute of High Performance Computing, Agency for Science, Technology and Research, Singapore
³Department of Information Systems and Analytics, National University of Singapore
[email protected], [email protected],
[email protected]

1 Evaluation Metrics

Concordance Correlation Coefficient (CCC lin1989concordance):

The CCC of vectors $X$ and $Y$ is:

\displaystyle\text{CCC}(X,Y)

\displaystyle\equiv\frac{2\text{Corr}\left(X,Y\right)\sigma_{X}\sigma_{Y}}{\sigma_{X}^{2}+\sigma_{Y}^{2}+\left(\mu_{X}-\mu_{Y}\right)^{2}}

(1)

where $\text{Corr}\left(X,Y\right)\equiv\text{cov}(X,Y)/(\sigma_{X}\sigma_{Y})$ is the Pearson correlation coefficient, and $\mu$ and $\sigma$ denotes the mean and standard deviation respectively.

2 Experiment Setup

Computing Infrastructure:

To train our models, we use a single Standard NV6 instance on Microsoft Azure. The instance is equipped with a single NVIDIA Tesla M60 GPU.

Average Runtime:

With the computing infrastructure, it takes about 1.5 hrs to train both models, where each model is trained with 200 epochs. Both model reaches maximum performances in about 1.5 hrs at about 100 epochs.

Number of Trainable Parameters:

The model trained on SST-5 that uses a LSTM decoder has 3,993,222 parameters. Additionally, the model trained on SEND that uses a MLP decoder has 4,715,362 parameters.

3 Task-specific Decoders

Long-Short Term Memory Network (LSTM):

For the time-series task, we use a LSTM layer hochreiter1997long to decode the context vector $c_{i}$ from our encoder for each window $i$ to output a hidden vector $h_{i}$ . Then, the hidden vector passes through a MLP to make the valence prediction:

	$\displaystyle h_{i}$	$\displaystyle=\text{LSTM}(h_{i-1},c_{i})$		(2)
	$\displaystyle\hat{r}_{i}$	$\displaystyle=\text{MLP}(h_{i})$		(3)

Multilayer Perceptron (MLP):

Our MLP contains 3 consecutive linear layers with a single ReLU activation in between layers. For the classification task, we feed in the context vector $c$ from our encoder to MLP to make the sentiment prediction:

	$\displaystyle f_{1}(c)$	$\displaystyle=\text{ReLU}(\textbf{W}_{1}c+\textbf{b}_{1})$		(4)
	$\displaystyle\hat{r}_{i}$	$\displaystyle=\textbf{W}_{3}\left[\text{ReLU}(\textbf{W}_{2}f_{1}(c)+\textbf{b}_{2})\right]+\textbf{b}_{3}$		(5)

where $\textbf{W}_{1},\textbf{W}_{2},\textbf{W}_{3},\textbf{b}_{1},\textbf{b}_{2},\textbf{b}_{3}$ are learnable parameters for linear layers. For the time-series task, the hidden vector $h_{t}$ is the input instead.

Appendix - Structured Self-Attention Weights Encode Semantics in Sentiment Analysis