This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Studying Differential Mental Health Expressions in India

Abstract

Every one in seven individuals suffers from mental health disorders in India, the most populous country in the world. Moreover, the stigma associated with mental illness prevents individuals from seeking professional help. Reddit, a popular social media platform, provides unique affordances to individuals for self-disclosing their experiences with mental health and seeking support anonymously. Our study aims to fill the gap in the understanding of the unique mental health challenges faced by individuals in India by investigating expressions on Reddit. We compare and contrast the language of mental health expressions of individuals geolocated in India and those outside while also accounting for a set of posts in non-mental health as the control set. Our analyses reveal significant cross-cultural differences in language expressions used in online mental health discourse, including several markers specific to India. Our findings reveal that Indians prefer low valence negative emotions, focus on the present when describing their mental health challenges, and swear less. Further, Indians tend to express more sadness, more causation-related words, and fewer personal pronouns compared to a matched control group. These data-driven findings were validated by two practicing clinical psychologists in India to assess the prevalence of India-specific mental health-related themes. Predictive modeling on our data also suggests the need for precision machine learning models to pick up cultural-specific signal from language sources. These findings have important implications for designing culturally appropriate interventions and policy changes to reduce the growing diagnosis and treatment gap for mental disorders in India.

Introduction

Over 197 million individuals in India are diagnosed with mental health disorders (sagar2020burden), a disproportionate majority of whom do not receive treatment (singh2018closing). The treatment gap for mental health disorders goes up to 95% in India, which is the highest across Asian countries and more severe compared to the gap of 78% in the US (murthy2017national; naveed2020prevalence). Furthermore, the suicide rate in India is 48% higher than the global average (srivastava2016mental). The reasons for these staggering statistics include challenges such as the stigmatization of mental health, a shortage of mental health professionals, and the small allocation of public budget to mental health workforce and well-being (meshvara2002mental; lahariya2018strengthen; krendl2020countries). These challenges also contribute to a huge diagnosis gap. Automated analyses of user-generated content could potentially enable early detection of mental health challenges and facilitate targeted assessments including support and treatment, especially in under-resourced contexts such as India, alleviating the challenges associated with traditional assessment methods (world2021ethics).

Refer to caption
Figure 1: Study overview to investigate the specific language markers of Indian users posting in Mental Health (MH) related subreddits compared to individuals across the rest of the world (RoW).

Several prior works have illustrated the use of social media data for understanding mental health expressions and the promise of user-generated language data to be used for automated risk-screening  (chancellor2020methods; guntuku2017detecting; eichstaedt2018facebook). Reddit, in particular, provides unique affordances, such as the choice to be anonymous, for individuals to openly discuss experiences and struggles with mental health and seek support (de2014mental; boettcher2021studies). Computational analyses of Reddit posts have resulted in identifying shifts to suicidal ideation (de2016discovering), identifying depression symptoms (gaur2018let; liu2023detecting), and studying the mental health expressions of immigrants (mittal2023language), among others.

The global standard of mental health research has largely been informed by data and practices from Western populations (fabrega2001mental). We believe that examining the diagnosis and treatment gap in India requires focusing on data from individuals in India in comparison to data from other nations. The purpose of this study is to apply a cross-cultural computational approach to learn about how individuals in India struggling with mental health conditions express themselves, to better inform a clearer understanding of the cross-cultural gap and potentially inform future interventions. This paper bridges two key gaps from previous literature (discussed in more detail in the next section) by mining social media data to understand mental health expression specific to India, validating data-driven insights with clinical psychologists who work with patients in India, and examining underlying cross-cultural differences in the expressions of mental health at an individual level.

In summary, we address the overarching question, if and how the mental health challenges of Indian users on social media are different from the rest of the world, and answer the following:

  • How do the psychosocial language markers and thematic content in Reddit posts of individuals experiencing mental health challenges in India vary compared to individuals in the rest of the world?

  • How well do our data-driven insights on mental health expressions align with the experience of clinical psychologists in India?

  • Could machine learning models built on the language of individuals identify mental health concerns specific to India?

As a corollary to answering the above, we also examine if Large Language models could be applied to effectively summarize and annotate latent themes obtained from topic modeling. Figure  1 provides an overview of the study.

Background

We discuss prior works in two key categories: 1) studies on social media to understand mental health language, and 2) studies on mental health prevalence and manifestations in India.

Social Media and Mental Health

Prior studies have demonstrated that social media data has significant predictive utility with regard to identifying conditions such as depression, anxiety, PTSD, and suicide ideation, among others. guntuku2017detecting; chancellor2020methods review the use of social media language and methods in identifying mental health conditions. While a majority of the studies were using data from Western countries, two studies looked at cultural variations in mental health expressions. For instance, an analysis of Twitter posts pulled with specific keywords was used to establish key differences between “Western world” countries (the United States and the United Kingdom) and “Majority world” countries (India and South Africa) (de2017gender) and suggested that India and South Africa-based users are less likely to be candid in their posts and less likely to exhibit negative emotions in comparison to their Western counterparts. Another study looked at Indian, Malaysian, and Filipino users on Mental Health Support Forums such as Talklife (pendse2019cross) and found that Indians prefer to discuss “wanting or needing friends” more than other countries, and included a comparison between Indian and US users which suggested that Indian users on some platforms are more likely to use clinical language than American users.

Mental Health in India

Prior work on mental health challenges in India has agreed upon the significance of the public mental health crisis in the country, with multiple studies citing the lack of public sector support, limited availability of psychology professionals, and cultural taboo as catalysts (khandelwal2004india; srivastava2016mental; hossain2019improving). A study applying National Mental Health survey data showed that depression and anxiety disorders had the highest contribution to Indian Disability Adjusted Life Years, suggesting that these are the most imminent mental health challenges in the country (sagar2020burden); and identified the key risk factors pertaining to the mental health crisis in India as being lead exposure, intimate partner violence, childhood sexual abuse, and bullying victimization. There have also been survey-based studies examining mental health symptoms in India in comparison to other countries (gada1982cross). Clinical Psychologists in India have been found to place greater weight on familial struggles as a barrier to mental health recovery than American psychologists, who placed greater weight on substance abuse (biswas2016cross). Moreover, somatic symptoms, hypochondriasis, anxiety, and agitation were found to be present in a significantly larger number of Indian patients than British patients (gada1982cross) when 100 depressed patients were surveyed. With respect to the mental health taboo in India, a survey of 3556 Indian respondents illustrated that 71% exhibit stigma-related responses when prompted with questions about mental health (live2018india).

Research Gap

Studies focused on India that conduct large-scale computational cross-cultural analyses of mental health are limited. Prior work by de2017gender and  pendse2019cross is the closest to this study. The grouping of Indian and South African tweets containing self-disclosure of mental health in  de2017gender poses a limitation to understanding the specific cultural nuances in each country. The particular focus in  pendse2019cross on identifying differences in clinical language across countries and the lack of a control group (discussions about non-mental health-related topics) leaves more to be explored in terms of the breadth of the psychosocial language markers and thematic content in mental health self-disclosures by Indians.

Indians are one of the largest populations suffering from mental health challenges and the third-largest country-level geographic user group on Reddit after the United States and Canada. We compare the language from the entire timelines of individuals geolocated to India and who also post in mental health subreddits with that of a coarsened-exact matched control set consisting of Indians who post in non-mental health subreddits and individuals from other (mostly Western) countries who post in mental health subreddits, and in non-mental health subreddits. Figure  1 shows an overview of the study. This provides an opportunity to obtain precision language markers associated with the mental health challenges of Indians going above and beyond the colloquial usage of terms within India and contrasting the mental health challenges of individuals outside.

Data

Reddit was used for our dataset for three reasons:

  1. 1.

    It is an anonymous platform that has been established as an accessible and effective research dataset for analyses of mental health, as per previous research discussed above.

  2. 2.

    India accounts for 5.2% of Reddit website traffic. It ranks 4th by in website traffic globally, after the US, Canada and the UK. In 2020, India was 3rd ranked in terms of number of users globally. This indicates that the platform can be used as a strong comparison between India and Rest of World (ROW) conversations surrounding mental health.

  3. 3.

    Reddit ranks 47th in India today in most popular social media apps; this ranking includes everything from dating sites to image apps and indicates that Reddit, while not as popular as Facebook or Instagram, offers a strong contender to study Indian users’ behavior on social media.

This section describes the process of arriving at our final dataset consisting of the entire timelines of Reddit users in four groups: “Individuals geolocated in India and posting in Mental Health Subreddits”: (MH-India); “Individuals geolocated outside India, Rest of World and posting in Mental Health Subreddits”: (MH-ROW); “Individuals geolocated in India and not posting in mental health subreddits”: (Control-India); and “Individuals geolocated outside India and not posting in mental health subreddits” (Control-ROW). The first group (MH-India) serves as our focus, and the remaining as controls.

Collection

The mental health subreddits queried include depression, SuicideWatch, sad, Anxiety, opiates, bipolar, BPD and selfharm, among others obtained from prior work (sharma2018mental; saha2020understanding). The largest portion of users from this category (36.1%) were members of r/depression. A total of 3,195,310 posts and comments were obtained from these subreddits using the PushShift API (baumgartner2020pushshift).

To create a list of individuals who are Indian or otherwise, we first follow a similar subreddit-based approach by grouping users based on those who post in India-focused subreddits such as r/India, r/IndiaSpeaks, etc. (full list of subreddits mentioned in Appendix 1) and then also identify geolocation of all users based on prior work (harrigian2018geocoding) that uses a text-based approach to geotagging users when geotags are not available. The geolocation process is described in detail in the next section. Once users’ timelines were extracted, we removed [deleted] usernames and null messages, and filtered users whose timeline of posts (but not comments) contained a minimum of 500 words. We specifically chose posts as the scope of this study was to obtain experiences with mental health challenges.The majority of our outside India dataset contains users from the US, and all messages in the dataset were from the 2019 and 2020 periods.

Geolocation

We use the geolocation inference approach introduced by harrigian2018geocoding for Reddit users. The approach is based on a location estimation model that utilizes word usage, the frequency distribution of subreddit submissions, and the temporal posting habits of each user to determine their location. Specifically, we use the pre-trained GLOBAL inference model 111https://github.com/kharrigian/smgeo/tree/master#models to geolocate users in our dataset. After inferring the location of users in different groups, we remove users that don’t belong to the group to obtain our final dataset.

Coarsened Exact Matching

Matching is a computational approach to identify statistical twins from the control group for every sample in the focus group. This minimizes the effect of covariates when estimating causality between the outcome i.e. variation in language and one’s ethnicity. Ideally, the samples from the focus and the control group should have indiscernible covariates. However, the exact matching (rosenbaum2020modern) is difficult to achieve without dropping a large set of samples. Coarsened Exact Matching (CEM) (iacus2009cem) is a softer version of Exact Matching which stretches the matching criteria wide enough to avoid dropping samples that are similar but not an exact match.

Refer to caption
Figure 2: The count of users for each country in the Rest of World control group (log scale). Demonstrates that the large majority of users in the ROW group are geolocated to Western countries. The ”Others” Category contains countries with less than 10 users including Belgium (9), Italy (9), Mexico (6), Malaysia (5), Romania (4), Croatia (4), UAE (2), South Africa (2), China (2), Spain (2), Greece (2), Denmark (1), Finland (1), Iceland (1), Japan (1), South Korea (1), Poland (1), Russia (1), Singapore (1), Thailand (1), Turkey (1) and Vietnam (1).

The subreddits queried for the Control-ROW group were queried across all remaining subreddits external to the mental health subreddits listed in the Appendix. The number of users in the MH-India (1200), MH-ROW and Control-ROW groups were limited by the 1200 users that were successfully geolocated, contained a minimum of 500 words and were members of both the Indian and mental health subreddits. The Control-India group was the smallest, with 930 users fitting each of the criteria (geolocation, India subreddits, non-mental health subreddits). To ensure we were able to utilize the largest amount of data possible, we chose to work with the 1200 users as a baseline for all but one of the groups.

Age and gender are well-known covariates when studying language (schwartz2013personality). To ensure the observed mental health language variation is not confounded by other attributes, we match the samples from our focus group i.e. MH-India with the samples in control groups (MH-RoW, Control-India, and Control-RoW) on covariates namely age and gender. The process of obtaining age and gender is described in Appendix 2.

We implement CEM using MatchIt package (stuart2011matchit) in R and set the distance to ‘mahalanobis’ for one-to-one matching. The quality of matching was evaluated using Standard Mean Differences and Kolmogorov-Smirnov Statistics. Details on the quality of matching are provided in Appendix 3. The distribution of age and gender in the final dataset are as follows:

  • Age: Majority of users between 13 and 36, with a mode at 24.40

  • Gender: Majority of users between -3.50 and 2.50, with a mode at -1.10 and negative numbers indicating male.

Group # Distinct Users # Posts
MH-India 1200 50928
Control-India 930 69957
MH-ROW 1200 54666
Control-ROW 1200 122654
Total 4530 298205
Table 1: Number of users and posts in each of the four groups of our dataset

Table 1 shows the total number of posts and users in each of the four groups after CEM and Figure 2 shows the distribution of countries in the RoW data.

Methods

In this section, we describe the language features extracted from our final dataset , the statistical analyses to obtain significantly correlated language markers for our focus group (MH-India), and the process used to annotate and validate the themes with the help of clinical psychologists.

Language Features

Building upon prior works that investigate mental health expressions on social media, we extract a diverse set of language features to understand the unique markers of mental health among Indians. For instance, several works have found that self-referential language (e.g., using personal pronouns such as I, me, myself) is indicative of poor mental health (tackman2019depression). Likewise, certain themes are specific to the Indian cultural context (e.g. the intense academic pressures owing to one of the highest competitive landscapes and parental involvement (verma2002school)). We extracted both closed- and open-vocabulary features to cover the breadth of the potential range of mental health challenges in our focus group.

N-grams:

We extract n-grams of length 1, 2, and 3 from posts and aggregate their counts to each username. We create a normalized bag-of-words representation for each user. To remove rare n-grams, we use point-wise mutual information (PMI) (bouma2009normalized) to retain phrases with a threshold greater than 5.

Linguistic Inquiry Word Count (LIWC):

LIWC 2022 is a dictionary of word categories that capture peoples’ psycho-social states. The words associated with 102 pre-determined categories in LIWC are counted for each user and the count is normalized by the total number of 1grams for each user, thereby representing each user as a vector of 102 normalized psychosocial categories.

Topics:

We use Latent Dirichlet Allocation (LDA) to discover underlying (latent) topics in users’ timeline data. Each topic represents a grouping of co-occurring words. While more recent neural topic modeling methods have been shown to have superior predictive accuracy, LDA was shown to provide qualitatively robust topics in prior work (dixon2022covid). We created three different sets of topics by varying the number of topics = [200,500,2000][200,500,2000]. After analyzing the Topic Uniqueness (TU) of all three and independent review by 3 coauthors, we selected 2000 topics to represent our dataset. TU is representative of the number of times a set of keywords is repeated across topics; a higher TU corresponds to a rarely repeated word, indicating that topics are diverse, which is favorable. 2000 topics generated on Reddit posts alone (excluding comments) provided the highest value for TU. Each user was scored based on the probability of mentioning each of the 2000 topics.

Refer to caption
Figure 3: Top 25 statistically significant N-grams by effect size for both MH-India and MH-ROW. Significant at p<.001p<.001, two-tailed t-test, Benjamini-Hochberg corrected. Repeated N-grams are omitted.
MH-India MH-ROW
Category Effect Size Top Words Category Effect Size Top Words
Sadness 0.428 depression, sad, depressed, cry, lonely Substances 0.645 drunk, wine, marijuana, vape, cbd
Negative Tone 0.324 bad, wrong, lost, hate, depression Health - Mental 0.625 depression, depressed, addiction, bipolar, paranoid
Affect Anxiety 0.197 scared, fear, afraid, worried, anxious, Health - General 0.519 pain, fat, tired, depression,sick
Negation 0.326 not, don’t, no, never, can’t Physical Feeling 0.613 feel, hard, felt, feeling, cool
Personal Pronouns 0.274 i, my, you, me, they Personal Pronouns 0.614 i, you, my, me, i’m
Linguistic Dimensions Auxiliary Verbs 0.198 is, have, was, be, are Adverbs 0.533 so, just, about, there, when
Time Present Focus 0.298 is, are, can, am, i’m Conjunctions 0.432 and, but, as, so, or
Health - Mental 0.290 depression, depressed, addiction, bipolar, adhd Common Verbs 0.424 is, have, was, be, are
Health - General 0.257 depression, pain, tired, sick, fat Common Adjectives 0.327 more, other, only, much, new
Feeling 0.174 feel, hard, feeling, felt, pain Linguistic Dimensions Impersonal Pronouns 0.308 it, that, this, what, it’s
Physical Illness 0.171 pain, sick, covid, painful, recovery Negative Tone 0.571 bad, wrong, lost, hit, hate
Causation 0.254 how, because, make, why, since Affect Anxiety 0.559 fear, worried, scared, afraid, worry
All-or-none 0.222 all, no, never, every, always Motives Allure 0.450 have, like, out, get, time
Cognition Insight 0.195 how, know, feel, think, find All-or-none 0.412 all, no, never, every, always
Communication 0.237 thanks, said, say, tell, talk Certitude 0.389 really, actually, real, completely, simply
Social Processes Politeness 0.195 please, thanks, hi, thank, ms Cognition Insight 0.292 how, know, think, feel, find
Motives Allure 0.237 have, like, get, know, now Time Past Focus 0.368 was, had, been, i’ve, were
States Want 0.200 want, wanted, hope, wish, wants Want 0.280 want, wanted, hope, wants, wish
Lifestyle Work 0.200 work, job, edit, working, school States Acquire 0.278 get, got, take ,getting, took
Drives Achievement 0.179 work, better, tried, best, able Social Referants Friends 0.273 bf, mate, buddies, mates, ally
Table 2: Top 25 LIWC categories for both MH-India and MH-ROW along with Pearson r effect sizes and top 5 words by frequency in our dataset. All categories shown are statistically significant at p<.05p<.05, two-tailed t-test, Bonferroni corrected.

Statistical Analysis

We performed ordinary least squares regressions with the three language feature sets (Ngrams, LIWC, and Topics) as independent variables each and each of the four groups (MH-India, MH-RoW, Control-India, and Control-RoW) as one hot encoded dependent variable. We calculated Pearson rr to measure the extent to which each feature relates to each of the groups in a one-vs-all setting. Since we explore several features simultaneously, we consider calculated coefficients as significant within the threshold of p<0.001p<0.001, after running a Bonferroni correction.

Annotating Significant Topics

In this paper, thematic annotations were generated using ChatGPT (OpenAI_ChatGPT) for each significant LDA topic for our focus group (MH-India) based on top words. To obtain thematic annotations, we prompt ChatGPT with the following instructions: ”In each row, please find the relationship between words and conclude a topic with one short phrase. Examples:
feel, myself, feeling, depression, anymore, hate, depressed, anxiety, alone, worse - struggling with loneliness and anxiety.
love, heart, loved, beautiful, happiness, miss, sad, joy, sadness, forever - mixed emotions”.
We provide two examples to the model as a few-shot prompt learning process, aiding its comprehension of our task and enabling the generation of more specific and succinct summary phrases. To enhance the quality of the generated annotations, we input 10 topics during each run and recursively generate annotations for all of them.

MH-India MH-ROW
TopicID Top Words Effect Size TopicID Top Words Effect Size
1807 life, parents, family, hate, die 0.280 334 feel, myself, feeling, depression, anymore 0.358
730 anymore, depression, tired, depressed, everyday 0.229 501 went, didn, crying, mad, stayed 0.340
334 feel, myself, feeling, depression, anymore 0.227 730 anymore, depression, tired, depressed, everyday 0.338
1531 friends, talk, social, anxiety, alone 0.226 1642 feeling, body, feels, heart, scared 0.322
560 love, heart, loved, beautiful, happiness 0.207 375 sick, woke, stomach, switched, asleep 0.318
1872 said, friend, told, friends, girl 0.180 851 didn, don, wasn, couldn, re 0.308
1757 college, exam, study, university, engineering 0.176 758 anxiety, depression, mental, medication, disorder 0.303
1221 learning, learn, data, programming, science 0.152 439 ve, don, re, ll, doesn 0.293
758 anxiety, depression, mental, medication, disorder 0.148 1412 said, didn’t, friend, told, asked 0.287
595 sister, ma, papa, clutching, plane 0.148 453 dad, mom, broke, suicide, crying 0.286
270 porn, days, nofap, fap, relapse 0.147 1262 buy, save, buck, cheap, bang 0.284
1736 learn, learning, books, resources, basic 0.147 1531 friends, talk, social, anxiety, alone 0.275
1492 against, themselves, society, opinion, political 0.144 595 sister, ma, papa, clutching, plane 0.273
899 family, mother, mom, father, dad 0.142 1010 sooner, handful, crossed, figuring, span 0.262
1326 job, degree, college, school, career 0.140 923 anime, manga, series, watched, japanese 0.260
439 ve, don, re, ll, doesn 0.140 1747 damage, level, attack, weapon, hit 0.231
832 hate, angry, rant, respect, ugly 0.139 1180 donate, donation, charity, donations, donating 0.229
1642 feeling, body, feels, heart, scared 0.133 1446 fucking, shit, fuck, hate, ass 0.229
419 relationship, together, wants, we’ve, ex 0.116 708 weeks, october, wednesday, waited, knocked 0.223
549 mind, human, universe, reality, self 0.122 832 hate, angry, rant, respect, ugly 0.222
Table 3: Top 20 Topics and their top words by frequency for MH-India and MH-RoW are shown. All categories shown are statistically significant at p<.001p<.001, two-tailed t-test, Bonferroni corrected.

Validation

We validated the significantly correlated topics for our focus-group by showing the top words and the LLM-annotations to two clinical psychologists who have significant practical experience with seeing patients and with mental healthcare in India. Specifically, we provided the following prompts:

  1. 1.

    To what extent the open vocabulary topics having significant correlation with MH-India group are prevalent in Indian patients? - A likert scale of 0-5 is provided where 55 indicates ‘Highly Prevalant’ and ‘0’ indicates ‘Not observed at all’.

  2. 2.

    Do the provided thematic labels accurately capture the meaning of topic words? The evaluators could mark Yes, No or Unsure. If no is selected, the evaluators were further prompted to suggest the correct label.

Predictive Model

We trained prediction models using Logistic Regression on each of the three feature sets on 10-fold cross-validation (sampled such that users in the training set are not present in the test set). to learn which is most able to predict group membership. We report Area Under the Receiver Operating Curves (AUC) for each of the features and groups (MH-India, MH-RoW, Control-India, Control-RoW) to compare the performance.

Results

N-grams:

Out of the 23,344 unique N-grams in our data, a total of 61 n-grams were found to be significant (p<0.001p<0.001) for the MH-India group, and 156 were significant for the MH-RoW group. Figure 3 illustrates the top 25 N-grams arranged in decreasing order of Pearson R for both groups. Personal pronouns (‘i am’, ‘i’), mentions of depression (‘depressed’, ‘depression’), and descriptions of symptoms (‘lonely’, ‘anxiety’) are prevalent in both groups, but there are also key differences. There is no mention of “anxiety” in the MH-India group’s top 25 N-grams, but it is a significant marker in the MH-RoW group while ‘lonely’ is. Furthermore, the social relationships (“parents”, “friends”) and academic-related stress (“college”, “exam”) appear in the MH-India group’s top 25 grams. This is particularly interesting considering age was already used as a matching criterion and yet discussions around student-related challenges are prevalent in expressions of mental health challenges. This suggests that student life and job stress are core aspects of the MH-India group’s experience with mental health challenges.

Moreover, “diagnosed” and “medication” appear in the MH-RoW group, but not in the MH-India group; this suggests that specific conversations around medications and diagnoses occur more in the outside India population.

LIWC

Table 2 summarizes LIWC results for the top 20 significant categories (by effect size) for each group. The set of all 102 LIWC categories along with Pearson r, p-values, and 95% confidence intervals are available in the Supplementary material. A total of 52 LIWC categories were significantly associated (p<0.001p<0.001) with MH-India group whereas the MH-RoW group had 60 significant categories. Both groups contained LIWC categories such as Negative Tone and Personal Pronouns, which is consistent with prior works on mental health (guntuku2017detecting). However, the top LIWC categories for the Indian group uniquely included Negations, Present Focus, Social Processes, and Causation. In contrast, those for the RoW group included Substances, Anxiety, Feelings, and Motive.

LDA Topics

Of 20002000 topics, 109109 were determined significant at p<0.001p<0.001 for the MH-India group and 216216 were significant for MH-RoW group. The most prevalent topics in the MH-India group discussed struggling with family / mental health and suicidal thoughts {life, parents, family, hate, die}, academic and job stressors {college, exam, study, university, engineering}, {job, degree, college, school, career} are not contained in the RoW group, though family is mentioned in other topics {dad, mom, broke, suicide, crying}. Common topics include those which have topwords anymore, depression, tired, depressed, feel, myself, feeling, anxiety, medication. The set of all 2000 LIWC categories along with Pearson r, p-values, and 95% confidence intervals are available in the Supplementary material.

N-grams LIWC 2022 2000 Topics
MH-India 0.853 0.776 0.758
MH-RoW 0.881 0.818 0.811
Table 4: AUCs for Logistic regression models predicting group membership.

Validation by Clinical Psychologists

Of the top 20 topics significantly associated with MH-India group from our data-driven insights process, 95% were ranked either extremely or somewhat prevalent (4 or 5 on a scale of 1 - 5) in India by at least one of the two clinical psychologists and 80% were ranked prevalent by both evaluators. Of all the 109 topics significantly associated with MH-India group, 56% were annotated as being prevalent by both evaluators.

Predictive Modeling

The prediction results in Table  4 suggest that all of the language features have fairly high accuracy, with AUC values no lower than 0.75, with the highest performing feature being N-grams (AUC=.88 for MH-RoW and .85 for MH-India). Overall, identifying mental health concerns among individuals in the RoW group is more accurate than in the India group.

Discussion

Our work reveals the significant differences in the language markers of mental health of Indian users compared to those from outside India. Low valence negative emotions (‘sadness’, ‘negative affect’, and ‘negative tone’) have emerged as the dominant markers for MH-India contrary to feelings of anxiety and worry in RoW users. The increased association with politness coupled with family and achievement indicate that the users in MH-India group do regulate their feelings and tend to associate mental health with their ability and social relationships as opposed to increased swearing in the RoW group. The prevalent themes as listed in Table 3 illustrate the concerns namely Academic Pressure and Parental pressures unique to Indian users. We acknowledge that the data collected from social media platforms tend to favor young demographic and therefore, these topics could be the result of the skewed data. However, the matching performed across control groups ensures that a similar demographic is retained in the control group.

There is an increased use of Focus Present words by users in MH-India which is in contrast with widespread belief about one’s tendency to discuss past events when suffering from mental illness (park2017living). We observe this pattern in the data but only for MH-RoW group. MH-India group tends to question and justify (‘causation’ words - how, because, why) more in their language in contrast to the feelings of self-loathing in MH-RoW group (de2017gender). Communication is another category that is seen only in MH-India group with an abundant presence of words such as phone, call, message, post/tweet/meme, sms/texting, chat indicating high reliance on online platforms for mental health support and privacy potentially due to higher stigma associated with mental health in face-to-face conversations (shidhaye2013stigma).

Almost 56%56\% of the topics significantly associated with MH-India group were labeled as prevalent by our clinical psychologist colleagues in Indian patients. The rest of the coherent topics revolve around Video Games/Online Content/online usernames, Grooming/Physical Appearance, and Programming. We speculate that these topics could be unrecognized concerns indicating unhealthy digital lifestyles, growing isolation, and low self-esteem amongst the undiagnosed young population.

Further, the quality of automated annotation using LLMs for open vocabulary topics is encouraging. Over 88%88\% of generated labels were marked accurate by at least one of the clinical psychologists, indicating in-domain knowledge of LLMs about mental health-related concepts. The use of LLMs to generate thematic summary for LDA topics is efficient compared to the traditional annotation, which requires annotators to find the relationship between keywords, which is time-consuming and requires expert domain knowledge. On the other hand, the LLM-based approach is able to quickly generate annotations with good agreement with human annotations.

In particular, we speculate that there are three groups of topics: 1) Topics that are well-documented by psychology professionals and the extracted topic process and 2) Topics that are not well documented by psychology professionals but are important considerations in the Indian user group - these are represented by topics that are significant according to the topic extraction process and not noted as prevalent by psychologists in India yet. The first group of topics contains 6 topics that are within our top 20 highest effect size coherent topics for the MH-India population and labelled as ”Extremely Prevalent” by psychiatrists. These include Topics 1531, 334, 560, 1326, 832, 1642, which as expected, revolve around discussions of feelings and anxiety. Two of topics are not part of the top 20 highest effect size coherent topics in the MH-ROW group, indicating potential key differences between MH-ROW and MH-India. These are topics titled Complex Emotions Related to Love and Education and Career Paths. The second group of topics include a range of incoherent topics as well as Topic 23, labelled Environmental Impact of Energy Sources, Topic 1773, labelled Humorous content and reactions and what appears to be a range of pop culture references. Previous research has suggested that people (particularly young people) are increasingly climate anxious, and that humour on social media is often used as a coping mechanism of mental health challenges (Schneiderhumour; Sansonclimate). Furthermore, it is possible that certain pop culture references are crucial to understanding the narrative - as one reviewer pointed out, the topword ”Singh” in Topic 837 may correspond to the suicide of late Bollywood star, Sushant Singh Rajput. These group of topics should potentially be further explored; while they may not be marked prevalent by experienced psychologists in India, we speculate that they are emerging and important aspects of Indian mental health expressions.

Implications and Broader Impact

The expanding population, and hence the growing treatment gap for mental disorders, is a major source of concern in Indian society. The economic loss due to mental health conditions, between 2012-2030, is estimated at USD 1.03 trillion222United Nations: https://www.who.int/india/health-topics/mental-health. Automated systems that could detect and track mental well-being could alleviate the lack of resources but they would only be useful when designed keeping in mind the cultural sensitivities and norms of society. Most of the current mental health research using language data is on Western-centric data. Our study sheds light on the unique challenges faced by individuals in the Global South, with a focus on India, and has important implications for designing culturally appropriate interventions to reduce the diagnosis and treatment gap for mental disorders.

Limitations

This work has several limitations. For instance, while two coauthors verified the geolocation estimates by the model for a set of 100 users, the text-based geolocation of individuals used in this study could potentially label Indians who are now ex-pats in other countries. Further, the Reddit user sample is not representative of the general population as evidenced by the mostly English language data in our India samples, although India has over 100 languages. Our work provides a glimpse of the significant cultural themes observed in Indian society however, English Linguistic markers extracted from online platforms represent a small population333 Digital 2023 India: https://datareportal.com/reports/digital-2023-india of India. While this analysis provides correlational insight into the data, it does not offer causal claims.

Ethical Considerations

There are several ethical implications associated with the use of Reddit data and the resulting insights of this paper. While Reddit data is public, it contains personally revealing information; confidentiality and user privacy protection are of utmost importance. We therefore only analyzed posts that were posted by users who did not delete their accounts. Our university’s Institutional Review Board deems this study exempt due to the public nature of all data. We have exercised caution in the way we have collected, processed, and presented these data to protect user privacy. An ethical limitation of this study is that we did not obtain informed consent from the individuals whose data was used in this study. We believe this line of research, when done ethically with respect to user anonymity and privacy could assist in understanding the mental health challenges of diverse individuals and in developing personalized interventions that improve the well-being and the mental health of under-resourced communities (proferes2021studying).

Appendix A Appendix 1: List of Subreddits Used to Extract the Raw Data

Mental Health:

r/Anxiety, r/bipolar, r/BipolarReddit, r/depression, r/sad, r/SuicideWatch, r/addiction, r/opiates, r/ForeverAlone, r/BPD, r/selfharm, r/StopSelfHarm, r/OpiatesRecovery, r/Sadness, r/schizophrenia, r/AdultSelfHarm

India:

r/india, r/mumbai, r/tamil, r/Hindi, r/Kerala, r/Urdu, r/delhi, r/pune, r/hyderabad, r/bangalore, r/kolkata, r/telugu, r/marathi, r/AskIndia, r/sanskrit, r/Kochi, r/Rajasthan, r/pali, r/Chandigarh, r/Chennai, r/karnataka, r/Bhopal, r/Coimbatore, r/kannada, r/TamilNadu, r/Trivandrum, r/gujarat, r/punjabi, r/Bengali, r/kolhapur, r/Vijaywada, r/Dehradun, r/sahitya, r/Uttarakhand, r/ahmedabad, r/bharat, r/nagpur, r/Agra, r/assam, r/Indore, r/surat, r/navimumbai, r/Goa, r/sikkim, r/lucknow, r/Bareilly, r/nashik, r/Allahabad, r/Durgapur, r/Jamshedpur, r/Asansol, r/indianews, r/IndianGaming, r/IndiaSpeaks, r/indiameme, r/dankinindia, r/indiasocial

Appendix B Appendix 2: Age and Gender

We applied an open-source age and gender predictive lexica (sap2014developing) to obtain continuous values of age and gender. This lexica was built over a set of over 70,000 users from social media and blogs and predicted age with a Pearson r of 0.86 and gender with an accuracy of 0.91 and has been applied reliably on Reddit data in prior studies (zirikly2019clpsych). We used the probabilities from this model to denote the gender attribute of users in our data and did not consider gender as a binary category.

Appendix C Appendix 3: CEM Quality of Matching

A control group is considered balanced with the treatment group if the difference is close to zero as indicated by a solid line in Figure  4. A total of 1200 users in the group MH-ROW, were matched and 1200 samples were picked from the group Control-ROW.

Refer to caption
(a)
Refer to caption
(b)
Figure 4: Differences in Covariates before and after CEM for groups “Control-RoW” and “MH-RoW”