Appendix - Structured Self-Attention Weights
Encode Semantics in Sentiment Analysis
1 Evaluation Metrics
Concordance Correlation Coefficient (CCC lin1989concordance):
The CCC of vectors and is:
(1) |
where is the Pearson correlation coefficient, and and denotes the mean and standard deviation respectively.
2 Experiment Setup
Computing Infrastructure:
To train our models, we use a single Standard NV6 instance on Microsoft Azure. The instance is equipped with a single NVIDIA Tesla M60 GPU.
Average Runtime:
With the computing infrastructure, it takes about 1.5 hrs to train both models, where each model is trained with 200 epochs. Both model reaches maximum performances in about 1.5 hrs at about 100 epochs.
Number of Trainable Parameters:
The model trained on SST-5 that uses a LSTM decoder has 3,993,222 parameters. Additionally, the model trained on SEND that uses a MLP decoder has 4,715,362 parameters.
3 Task-specific Decoders
Long-Short Term Memory Network (LSTM):
For the time-series task, we use a LSTM layer hochreiter1997long to decode the context vector from our encoder for each window to output a hidden vector . Then, the hidden vector passes through a MLP to make the valence prediction:
(2) | ||||
(3) |
Multilayer Perceptron (MLP):
Our MLP contains 3 consecutive linear layers with a single ReLU activation in between layers. For the classification task, we feed in the context vector from our encoder to MLP to make the sentiment prediction:
(4) | ||||
(5) |
where are learnable parameters for linear layers. For the time-series task, the hidden vector is the input instead.