This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

MTLHealth: A Deep Learning System for Detecting Disturbing Content in Student Essays

Joseph Valencia
ACT, Inc.
Oregon State University
[email protected]
&Erin Yao
ACT, Inc.
[email protected]
(07/31/2019)
Abstract

Essay submissions to standardized tests like the ACT occasionally include references to bullying, self-harm, violence, and other forms of disturbing content. Graders must take great care to identify cases like these and decide whether to alert authorities on behalf of students who may be in danger. There is a growing need for robust computer systems to support human decision-makers by automatically flagging potential instances of disturbing content. This paper describes MTLHealth, a disturbing content detection pipeline built around recent advances from computational linguistics, particularly pre-trained language model Transformer networks.

1 Introduction

In the United States and many other developed countries, mental health crises are becoming more and more visible, including within the youth population. Suicides among 15-19 year-olds are on the rise (CDCMMWR, 2017). Mental health issues place a huge strain on students and their families, as well as on teachers, social workers, and public health officials.

ACT, Inc. is an educational non-profit that provides a variety of testing and learning products. It is well known for providing assessments for college admissions, k-12 subject areas, workforce skills and social emotional learning skills. Machine learning and natural language processing technology play a vital role in the organization’s ability to efficiently process these examinations, including in automated scoring engines to evaluate short and extended constructed response questions in partnership with human graders.

Occasionally, issues related to mental health and trauma surface in essay-based examinations. Graders who are tasked with evaluating essays sometimes notice disturbing content that could necessitate intervention from authorities.

In the same way that machine learning and natural language processing techniques help to grade essays, they can help educational organizations like ACT monitor submissions for disturbing content. Recent advances in deep learning for natural language processing (NLP) have brought great improvements in general language understanding tasks. Such systems can be adapted to recognize language indicative of mental health issues and trauma.

This paper describes a study conducted at ACT with the goal of building a robust deep learning model to automatically flag disturbing content in student responses. Based in part on previous definitions, ACT considers disturbing content to include references to: (1) Attempts at or thoughts of suicide/self-harm, (2) violence or plans to harm others, (3) Physical, sexual, or emotional abuse (4), bullying or social isolation, and (5) substance abuse (Burkhardt et al., 2017) (Ormerod and Harris, 2018).

The resulting system is called MTLHealth. MTLHealth was designed to target student constructed responses, but its predictions could be useful across a wide variety of applications and textual domains, such as for monitoring social media. Note that further evaluation studies should be conducted to understand the performance of the system before it is used operationally.

2 Related Work

Prior research provided a strong foundation for the development of MTLHealth. Two areas of the computer science literature proved especially relevant: general methods in natural language understanding, and NLP applications in computational psychology. Additionally, a few papers have explored the specific problem of machine learning approaches to detecting disturbing content.

2.1 Deep Learning for Natural Language Understanding

Particularly in the period of 2018-2019, a wave of papers introduced systems that leveraged new approaches to transfer learning, a paradigm in which models trained on one type of data are re-purposed for a new task. These systems used unsupervised pre-training methods to initialize models for a variety of downstream tasks.

The power of transfer learning is apparent in its successful application to a number of challenging benchmark tests. SQuAD is one such challenge, in which the answer to a question must be highlighted within a context paragraph (Wang et al., 2018). GLUE is another benchmark, consisting of nine diverse language understanding tasks, including natural language inference, sentiment analysis, and paraphrase generation (Rajpurkar et al., 2018). On both datasets, deep learning systems have matched human baseline performance.

Two overarching trends in the literature appear to be responsible for much of the progress in language understanding. One is the use of pre-training on massive datasets using objectives similar to language modeling. The other is an increased reliance on Transformer neural network layers.

2.1.1 Language Model Pre-Training

Language modeling is the task of inferring a word in a sentence from the previous words in the sentence. Given the text, “ School and work have been pretty stressful lately, I really need a BLANK ”, a well-trained language model should assign high probability to appropriate words like ‘vacation’, or ‘nap’ and low probability to nonsensical concluding words like ‘dolphin’ or ‘running’.

The problem formulation of language modeling is simple, but successful performance requires a complex model of the statistical properties of text. Written language exhibits rich structure – sentences must adhere to grammatical and other syntactic rules, and the co-occurrence of words captures a good deal of their semantic relationships (Linzen et al., 2016). Additionally, language modeling is an unsupervised method, meaning it requires no tedious annotation of data by humans. Moreover, the Internet provides easy access to vast amounts of English text to train on. Because of these desireable qualities, language modeling is increasingly recognized as the ideal task for pre-training NLP models (Howard and Ruder, 2018).

Older methods in NLP produced static vector representations of words through training tasks similar to language modeling. For example, Word2Vec calculates a probability distribution along a sliding window of text to build a representation of words as dense vectors of small floating point values (Mikolov et al., 2013). Similar words have similar representations in vector space and simple arithmetic operators like addition and subtraction can often approximate simple analogies, for example:

KingMan+WomanQueen\vec{King}-\vec{Man}+\vec{Woman}\approx\vec{Queen}

ELMo is one of the pioneering systems that used deep learning language model pre-training to significantly approve performance across a variety of downstream tasks (Peters et al., 2018). ELMo uses a bi-LSTM architecture where one LSTM is trained as an ordinary left-to-right language model and another as a right-to-left language model. Each token in a sentence is represented as a function of all cell states from both LSTMs. This representation provides more context-specific information than does a static word vector.

ULMFit is a language model approach to text classification based on ‘fine-tuning’, which consists of transferring a pre-trained network to a separate task and re-training the entire network on data for this downstream task (Howard and Ruder, 2018). It was the first model to propose a framework for generalized fine-tuning, which was a departure from approaches to transfer learning that exported fixed representations between tasks. It also introduces new techniques to vary learning rates and to prevent catastrophic forgetting of previously learned information by gradually unfreezing parameters.

2.1.2 Transformer Networks

The Transformer is a neural network architecture first proposed by Google Research in the influential paper “Attention is All You Need” (Vaswani et al., 2017). The motivation for the design was the prevalence of recurrent architectures like LSTMs and their relative computational inefficiency. Recurrent neural networks require one computational pass for each word in an input sentence. These must proceed in sequence, so they cannot be readily parallelized. This limitation makes recurrent networks expensive to train on large datasets.

The central innovation of Transformers is to replace recurrence with extensive use of attention mechanisms. Attention mechanisms are learned weights that allow the model to emphasize or ‘attend to’ important parts of a sequence for a particular task. They have proven very useful in applications like machine translation, in which they help to model soft alignments between words in the source and target languages. The paper introduces a technique called self-attention, which compares a sentence with itself, calculating an attention distribution relative to each token to capture the relationships between words. The Transformer is built of sub-modules that calculate multiple self-attention distributions and pass them to a feed-forward layer. These sub-modules are stacked into a large encoder-decoder network.

BERT uses a large Transformer trained on a masked language model objective (Devlin et al., 2018). Random tokens are replaced with a special masking token, and the system is trained to recover the masked tokens. This allows BERT to incorporate bi-directional context into the representations that it learns. After this pre-training step, BERT is fine-tuned to specific datasets and significantly advanced the state of the art on GLUE and SQuAD.

Other Transformer based models have since exceed BERT’s performances, with improvements ranging from very slight to substantial gains on benchmark metrics. MT-DNN is a system that reproduced the BERT architecture and added multi-task learning, i.e. simultaneous training on several downstream tasks(Liu et al., 2019a). XLNet is another system that uses an autoregressive formulation of language modeling that removes the need for input masking (Yang et al., 2019). RoBERTa is a replication of BERT that better tunes its hyper-parameters (Liu et al., 2019b).

2.2 Computational Psychology

An emerging sub-field of research within computational psychology involves applying general NLP techniques to analyze the connection between language use and mental health. Many studies make use of textual data from web sources, particularly from Twitter and Reddit.

The Association for Computational Linguistics holds a yearly workshop in Computational Linguistics and Clinical Psychology (CLPsych). CLPysch has sponsored the creation of multiple mental health datasets and held competitions on machine learning for mental health. Past competitions have centered around efforts to triage posts on the suicide support website ReachOut.org and to predict depression and anxiety levels throughout life based on youth essays (Milne et al., 2016) (Lynn et al., 2018). The most recent involved predicting the degree of suicide risk on the r/SuicideWatch subreddit (Zirikly et al., 2019).

A paper by Gkotsis demonstrated the ability of simple fully-connected and convolutional neural network architectures to classify Reddit posts as originating from mental health subreddits vs non mental health subreddits (Gkotsis et al., 2017). A study by Yates used similar methods on user posts in non mental-health subreddits to predict whether those same users would later announce a depression diagnosis on r/depression (Yates et al., 2017).

2.3 Disturbing Content Detection

Some prior work has directly explored the problem of detecting disturbing content in student responses. Earlier research at ACT produced a disturbing content pipeline based on an ensemble of multiple non-neural machine learning methods trained on selected Reddit posts (Burkhardt et al., 2017). A study by the American Institutes for Research built a classifier for a large internal dataset of constructed response and compared the performance of several varieties of recurrent neural network architectures. (Ormerod and Harris, 2018).

3 Methods

Refer to caption
Figure 1: Pipeline Architecture for MTLHealth.

The basic architecture of MTLHealth consists of BERT models trained on various tasks related to mental health and disturbing content. To build a classifier for disturbing content, the prediction scores obtained from evaluating student responses serve as features for a logistic regression classifier. Figure 1 is an illustration of this pipeline.

3.1 Datasets

MTLHealth is trained on a variety of labeled datasets as described below.

  • Emotion Classification: \sim6k Tweets annotated with 11 emotions including anger, disgust, joy, sadness, and trust. (Mohammad et al., 2018)

  • Toxic Comment Classification: \sim150k Wikipedia comments annotated with labels including toxic, hate speech, obscene, and threat.111https://www.kaggle.com/c/jigsaw-toxic-comment-classification-
    challenge

  • UK Birth Cohort Essays: \sim10k essay responses from 11 year olds in 1969 who were asked to imagine their lives at age 25. Each essay is annotated with a numerical score of its author’s depression level as measured by a psychological examination following the British Social Adjustment Guide. (University College London, 2018) (Zirikly et al., 2019)

  • Subreddit Data: \sim357k user posts in Reddit forums focusing on issues related to mental health along with control posts from subreddits unrelated to mental health. Annotated with subreddit origin information. Some examples are: r/StopSelfHarm, r/SuicideWatch, r/addiction, and control.222 Mental health data obtained via personal correspondence with author of (Gkotsis et al., 2017) and control taken from Reddit dump at https://files.pushshift.io/reddit/comments/.

  • ACT Constructed Responses: Consists of 102 constructed responses annotated with disturbing status. 42 flagged as disturbing (label of 1) , 80 flagged as normal (label of 0).

The presence of disturbing content in student responses is a relatively rare occurrence. To bolster the MTLHealth system, non ACT datasets were used to encourage the system to flag a wider range of disturbing responses.

The selected datasets were chosen to ensure the system is informed by data that is diverse in terms of domain (short tweets vs long Reddit posts), age of speakers (children vs adults), time period (1960’s vs present day), and topic. Emotion and toxicity are somewhat distinct from mental health, but may be strongly correlated with mental health outcomes (Hu et al., 2014).

3.2 Training Setup

The non-profit AI organization HuggingFace maintains pytorch-transformers, a library for the deep learning framework PyTorch that allows users to build models initialized with pre-trained weights from Google 333https://github.com/huggingface/pytorch-transformers. MTLHealth uses BERT-Base uncased. BERT-Base uncased is the smaller version of BERT, with 12 Transformer layers, a hidden dimension of 768, and nearly 110M total parameters (Devlin et al., 2018).

Following the usage guide for BERT, MTLHealth pre-pends all input sentences with the special token ‘[CLS]’. After fine-tuning, the vector output corresponding to this token serves as the overall representation of the sentence. The pooled output is passed through multiple fully-connected layers with ReLU functions Eq. (9) before being passed to a task-specific output layer.

Training neural networks involves selecting hyper-parameters.

The ‘learning rate’ is a multiplier for the parameter updates which are calculated during backpropagation. This value affects convergence of the model, if the learning rate is set too low, training could take excessively long. If the learning rate is too high, training may be unstable and the system may never converge. MTLHealth uses lr=2e5lr=2e-5.

The ‘maximum sequence length’ defines a maximum number of tokens beyond which input sentences are truncated. Unlike many training schemes, MTLHealth does not use a fixed batch size. Instead, it accepts a hyper-parameter of ‘maximum tokens per batch’ and attempts to batch sequences of similar sequence length to be close to this value. This scheme follows the implementation found in (Rush, 2018).

MTLHealth makes use of a technique called dropout, where a percentage of the output is randomly set to zero, which helps to prevent over-fitting by removing noise in the representations (Srivastava et al., ). Dropout probability is set at p=0.5p=0.5.

During fine-tuning for each task, all layers, including those belonging to the pre-trained model, are trained using the ‘Adam’ optimizer, a variation of the standard stochastic gradient descent algorithm that incorporates the concept of ‘momentum’ to automatically scale the learning rate (Kingma and Ba, 2014).

Training typically involves multiple passes through the data, which is known as an epoch. Training is allowed to continue for a maximum number of training epochs max_epochs=20max\_epochs=20. If a chosen task-specific evaluation metric does not improve for several consecutive epochs patience=5patience=5, training stops automatically.

Hyper-parameter values not found here are listed in the Appendix.

There are three basic problem formulations that MTLHealth is trained on: classification, multi-label classification, and regression. A BERT model was trained for each dataset with an appropriate problem formulation: classification for the Reddit dataset, multi-label classification for the Twitter Emotion and Wikipedia Toxicity datasets, and regression for the 1969 UK Essay dataset.

  • Classification involves identifying a sentence as belonging to one of CC classes. Classes are mutually exclusive. The output activation function is a logistic softmax. Eq. (8). Classification is optimized with the cross-entropy loss function. Eq. (11).

  • Multi-label classification involves identifying a sentence as belonging to 0-C of CC classes. Classes are not mutually exclusive, and a sentence can potentially carry some, all or none of the class labels. The output activation function is a sigmoid. Eq. (7). Multi-label classification is optimized with the cross-entropy loss function . Eq. (11).

  • Regression involves predicting a numerical score for a sentence. The output activation function is an identity. Regression is optimized with the mean squared-error loss function Eq. (10).

After training, each BERT model can be applied to any input text. For regression, the output is a predicted score. For classification (including multi-label), the output is the probability of belonging to each class. In turn, these predictions serve as input features for a simple logistic regression classifier. This classifier yields a single yes or no label indicating whether the text includes disturbing content. This system is trained to classify the ACT constructed response data.

All training and experiments were conducted on single p2.xlarge AWS instance with access to half of an NVIDIA Tesla K80 GPU with 12 GB RAM.

4 Results

Table 1: Toxic Comment Multi-label Classification
Model ROC-AUC mic-F1 mac-F1
Crusader 0.98856
neongen 0.98822
Adversarial 0.98805
MTLHealth 0.9824 64.3 26.3
Table 2: Twitter Emotion Multi-label Classification
Model Accuracy F1-mic F1-mac
NTUA-SLP 58.8 70.1 52.8
TCS Research 58.2 69.3 53
PlusEmo2Vec 57.6 69.2 49.7
MTLHealth 56.9 69.2 51.5
Table 3: UK Essay Regression
Model Pearson R MAE
Coltekin et al. 0.467 0.968
UGent-IDLab 1 0.454 1.004
UKNLP 1 0.433 0.951
MTLHealth 0.288 0.500
Table 4: Reddit Post Classification
Acc. Recall Precision F1
84.4 84.0 85.2 84.6
Table 5: ACT Disturbing Content Classification
Features Acc. Rec. Prec. F1
UK Essays
Toxic 84.4 64.4 90.5 73.3
Emotion 89.3 81.4 89.0 83.5
Reddit 90.2 90.3 85.1 86.7
Toxic+Emotion 92.6 86.1 93.1 88.6
Redd.+Toxic+Emo. 95.1 88.6 97.8 92.3

MTLHealth was trained for multi-label classification on the Wikipedia Toxic Comment dataset. The train and test datasets were restricted to a subset of about 6k examples in order to limit training time. Table 1 compares the performance of MTLHealth with the three top models from the Kaggle Leaderboard. The only metric on the leaderboard is ROC-AUC, in which MTLHealth performed comparably to the previous top submissions. Also included are macro-averaged and micro-averaged F1 for MTLHealth. Eq. (3).

MTLHealth was trained for multi-label classification on the Twitter Emotion Dataset. Table 2 compares its performance with the three best performing models on Task A from the ACL SemEval 2018 Workshop. (Mohammad et al., 2018). MTLHealth achieved competitive results in all three metrics, including the main contest metric of multi-label accuracy (also known as Jaccard Index). Eq. (4).

MTLHealth was trained for regression on the UK Essay Depression Score dataset. This problem first appeared as Task A of the CLPsych workshop at ACL 2018. The training and testing datasets were constructed using the original data provided by the UK Data Service. Therefore, the breakdown into test and train sets will be different from that used in the contest. Table 3 compares MTLHealth with the top three models from this contest. MTLHealth performs considerably worse than the others on Disattenuated Pearson R , but much better on the Mean Absolute Error metric. Eq. (6) Eq. (5)

Table 4 lists the classification metrics for the Reddit task. Because the training dataset adds manually gathered control data to a subset of the mental health data from (Gkotsis et al., 2017), MTLHealth is not directly comparable with any prior system. As shown in the confusion matrix in Figure 2, the system achieves a high rate of correct classifications. By far the largest sources of confusion are misclassification of posts from r/SuicideWatch and r/selfharm as coming from r/depression, which can be explained by the close semantic relationship of these three subreddits.

The final disturbing content predictor for MTLHealth is a basic logistic regression model trained on the ACT constructed response data. Every entry in the constructed response set was split into chunks of 50 tokens and passed chunk by chunk into the BERT models for the Wikipedia Toxic, Twitter Emotion, and Reddit tasks. The feature representations for each constructed response were the output scores averaged over all chunks. The UK Essay model was excluded from the analysis due to its poor performance on testing data. Table 5 shows results from a 5-fold cross-validation evaluation using different combinations of the feature sets.

Refer to caption
Figure 2: Confusion Matrix for Reddit Classification
Refer to caption
Figure 3: Disturbing Content Classifier Weights

5 Discussion

Overall, the results of the study indicated that the BERT architecture adapts well to many applications related to disturbing content. On the Reddit, Wiki Toxic, and Twitter Emotion datasets, MTLHealth achieved high performance on all metrics. These features were highly useful in predicting the disturbing content status of ACT constructed responses.

5.1 Disturbing Content Feature Importance

Table 5 reports the results of an ablation study, which demonstrated the usefulness of the features extracted by the various BERT pipelines. The Reddit model produced by far the most predictive features of any single model, and the toxicity pipeline produced the least useful features. The best results were obtained using all feature sets besides the depression scores from the UK Data Essays. Fig (3) represents the magnitude of the feature weights learned by the final logistic regression in MTLHealth. Features like ‘suicideWatch’ ,‘severe_toxic’, and ‘sadness’ received high positive weights, meaning they are positive contributions to classifying a response as disturbing. Similarly, features like ‘optimism’,’trust’, and ‘control’ were highly negative indicators of disturbing content.

5.2 Comparison to Prior Work

The use of transfer learning and Transformer architectures allows MTLHealth to benefit from the state-of-the art methods in 2019. The incorporation of diverse training data also makes MTLHealth unique among disturbing content detection systems in its capabilities across a variety of tasks.

MTLHealth was trained on over 200 times more text examples than the prior work at ACT. The older version only uses Reddit sources, while MTLHealth is able to make predictions on varying input domains and problem formations. (Burkhardt et al., 2017). Whereas the old system used an ensemble of non-neural machine learning models all trained on one dataset, MTLHealth builds an ensemble that incorporates information from several different datasets into its predictions. This should predispose MTLHealth towards more generally applicable predictions, although this must be confirmed with validation on more data.

The paper from the American Institutes for Research (AIR) does not report conventional classification metrics, instead focusing on the total number of disturbing essays that are caught by choosing a fixed percentage of the essays with the highest probability of being disturbing (Ormerod and Harris, 2018). Crucially, unlike the AIR system, most of the training data for MTLHealth comes from publicly available datasets.

5.3 Future Work

The name MTLHealth refers both to its focus on mental health, and to the eventual goal of incorporating a training paradigm known as multi-task learning into the package. A good deal of literature indicates that simultaneously training a system on related tasks using shared parameters can yield better results than training separate models for each individual dataset (Hashimoto et al., 2016) (McCann et al., 2018). Future work will extend MTLHealth to an architecture based on MT-DNN (Liu et al., 2019a). In essence, each task will have a small number of output layers specific to the task, while the lower level representations will be in common for all tasks. This scheme will require the model to learn representations that are widely applicable across tasks.

Additionally, the MTLHealth software must undergo more rigorous review and testing before it is put into operation as part of the disturbing content detection protocol for ACT. Specifically, it should conduct more extensive empirical validation of MTLHealth predictions as additional constructed responses are annotated and transcribed. Given enough of this internal data, MTLHealth could potentially be trained to classify constructed responses in an end-to-end manner, i.e. without the intermediate feature extractors, or with the student data as part of an ensemble with the existing trained models. It must also consider the availability of manual graders in order to decide a threshold above which a probability estimate from MTLHealth should trigger an alert.

Acknowledgments

The authors would like to acknowledge John Whitmer and Brian LaMure of ACT for providing access to essential computing resources.

References

Appendix A Appendices

A.1 Additional Hyperparameters

Model MaxTokens/Batch MaxLen
UK Essays 5000 224
Wiki Toxic 5400 324
Twitter Emotion 5400 324
Reddit 5000 224


MaxTokens/Batch is maximum number of tokens to fit in a batch.
MaxLen is the maximum length of text input in tokens.

A.2 Classification Metrics

For all equations, N refers to the number of samples, ee refers to the base of the natural logarithm, and log\log is base 2.

Recall=TPTP+FNRecall=\frac{TP}{TP+FN} (1)
Precision=TPTP+FPPrecision=\frac{TP}{TP+FP} (2)
F1=2PrecisionRecallPrecision+RecallF1=2\cdot\frac{Precision\cdot Recall}{Precision+Recall} (3)
JaccardIndex=1Ni=1NTPiTPi+FPi+FNiJaccardIndex=\frac{1}{N}\sum_{i=1}^{N}\frac{TP_{i}}{TP_{i}+FP_{i}+FN_{i}} (4)

TP = True Positives
FP = False Positives
TN = True Negatives
FN = False Negatives
Subscripts ii for (4) indicate that metric is calculated on each example and averaged.

A.3 Regression Metrics

PearsonD(X,Y)=cov(X,Y)σXσY10.770.70Pearson_{D}(X,Y)={\frac{\operatorname{cov}(X,Y)}{\sigma_{X}\sigma_{Y}}}\cdot\frac{1}{\sqrt{0.77\cdot 0.70}} (5)
MAE=1Ni=1N(YiYi^)MAE=\frac{1}{N}\sum_{i=1}^{N}(Y_{i}-\hat{Y_{i}}) (6)

σ\sigma is the standard deviation.
covcov is the covariance.
YiY_{i} is the predicted score for sample ii
Yi^\hat{Y_{i}} is the true score for sample ii
(5) is a ‘disattenuated’ version of Pearson correlation R that accounts for measurement error and re-scales the metric to [-1.362,1.362] , as proposed in (Lynn et al., 2018).

A.4 Activation Functions

𝒞\mathcal{C} refers to the set of classes for a classification problem.

Sigmoid(x)=11+exSigmoid(x)=\frac{1}{1+e^{-x}} (7)
LogSoftmax(x)=logexij𝒞exjLogSoftmax(x)=\log\frac{e^{x_{i}}}{{\sum_{j\in\mathcal{C}}e^{x_{j}}}} (8)
ReLU(x)=max(0,x)ReLU(x)=max(0,x) (9)

A.5 Loss Functions

𝒞\mathcal{C} is the set of classes.
pip_{i} is the true distribution for class ii.
qiq_{i} is the predicted distribution for class ii.

MSE=1Ni=1N(YiYi^)2MSE=\frac{1}{N}\sum_{i=1}^{N}(Y_{i}-\hat{Y_{i}})^{2} (10)
CrossEntropy=i𝒞pilogqiCrossEntropy=-\sum_{i\in{\mathcal{C}}}{p_{i}\,\log q_{i}} (11)