Automatic Detection of Twitter Users Who Express Chronic Stress Experiences via Supervised Machine Learning and Natural Language Processing

Americans bear a high chronic stress burden, particularly during the COVID-19 pandemic. Although social media have many strengths to complement the weaknesses of conventional stress measures, including surveys, they have been rarely utilized to detect individuals self-reporting chronic stress. Thus, this study aimed to develop and evaluate an automatic system on Twitter to identify users who have self-reported chronic stress experiences. Using the Twitter public streaming application programming interface, we collected tweets containing certain stress-related keywords (eg, “chronic,” “constant,” “stress”) and then filtered the data using pre-defined text patterns. We manually annotated tweets with (without) self-report of chronic stress as positive (negative). We trained multiple classifiers and tested them via accuracy and F1 score. We annotated 4195 tweets (1560 positives, 2635 negatives), achieving an inter-annotator agreement of 0.83 (Cohen's kappa). The classifier based on Bidirectional Encoder Representation from Transformers performed the best (accuracy of 83.6% [81.0-86.1]), outperforming the second best-performing classifier (support vector machines: 76.4% [73.5-79.3]). The past tweets from the authors of positive tweets contained useful information, including sources and health impacts of chronic stress. Our study demonstrates that users' self-reported chronic stress experiences can be automatically identified on Twitter, which has a high potential for surveillance and large-scale intervention.

S tress is a significant public health problem in the United States. It permeates Americans' lives and brings about adverse health outcomes when left unmitigated. According to the Burden of Stress in America Survey conducted in 2014, half of the respondents reported having suffered major stressful experiences in the past year, whereas 26% reported having been under a "great deal" of stress over the past month. 1 Additionally, the American Psychological Association showed that the average stress levels for the past month were consistently higher (about 1.5 times) than the perceived healthy levels of stress for the course of over one decade (2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018)(2019). 2 Finance, health, work, family, relationships, personal safety, discrimination, and even politics, among others, have been identified as major sources of stress in the United States. 2,3 Not surprisingly, in 2020-2021, the COVID-19 pandemic was a significant stressor to the US population. Seventy-eight percent of adult Americans expressed that the pandemic was a significant source of stress, and 67% expressed increased stress during the pandemic. 4 Chronic stress is defined as the "physiological or psychological response to a prolonged internal or external stressful event (ie, a stressor)." 5 Chronic stressors include but are not limited to neighborhood environment, financial strain, interpersonal stress, work stress and burnout, and caregiving. 6 It has been well documented that chronic or cumulative stress contributes to poor mental health, including depression and anxiety. Chronic stress also takes a toll on the body through wear and tear, resulting in the dysfunctions of cardiovascular, metabolic, endocrine, and immuno-inflammatory systems. In extreme cases, chronic stress can be fatal (eg, suicide). 7,8 Moreover, prolonged exposure to stress may lead to unhealthy behaviors (eg, smoking, substance use) as coping mechanisms. 9 The impact of chronic stress goes beyond individual health and affects the entire population, particularly through increases in healthcare costs. It is estimated that stress-related illnesses and injuries cost $300 billion every year in the United States. 10 In past studies, stress has been assessed by various methods, including interviews, questionnaires/surveys, and biomarkers, with few agreed-upon "gold standard" measures. 6 Notably, prevalence of stress within large populations has been measured predominantly via surveys. Despite their effectiveness, surveys have several limitations associated with the accurate measurement of population-level stress, such as cost (making it difficult to scale up, resulting in small sample sizes); lack of generalizability (respondents often not representing the target population very well); emotional distress to participants caused by recollection of painful memories; underreporting due to stigma attached to events related with chronic stress; and the reliance on retrospective recall, which often leads to underdocumentation/underreporting of real-time stress or changes in the state of stress over time. [11][12][13] Therefore, there is a critical need to develop innovative methods to investigate chronic stress that are low cost, scalable, generalizable, and minimally obtrusive so that response bias is minimized.
Social media are potentially valuable resources for studying chronic stress and its characteristics and ultimately identifying targeted population subsets with chronic stress for interventions. Social media platforms have extensive reach within populations and even to hard-to-reach populations. In 2020, 72% of Americans reported using at least one social media platform. 14 Social media enable the collection of data in near real-time, non-invasively, and at low cost. The data collection does not need to depend on individuals' memories, which may pose psychological harms during the process. Instead, narrative data on social media platforms consist of the postings through which users share certain experiences at the time of the experiences (often referred to as "self-reports"). They are posted voluntarily by the individuals and are often publicly available. Such public social media data can be analyzed using natural language processing and machine learning methods to automatically identify users with chronic stress expressing distress from "prolonged stressful events" 5 ; extract various information (eg, sociodemographic, health, behavioral, and geographic) from the selected user subsets; and gain insights into their pattern of chronic stress experiences. Such a capacity could provide opportunities to develop both individual-and population-level strategies for health promotion at a large scale. Because of these advantages, social media have been used for pharmacovigilance, 15 toxicovigilance, 16 and mental health topics, including suicidal ideation, 17 posttraumatic stress disorder, 18 depression, 18 and schizophrenia. 19 By far, only a handful of studies 20 have attempted to measure psychological stress from social media, although none of them specified which type of stress they targeted. It is critical to distinguish between different types of stress (eg, acute vs chronic stress) since the existing body of literature suggests possibly different neurobiological and psychosocial processes, properties, and consequences of different types of stress. 21 Being under chronic stress puts individuals at a greater risk for experiencing a greater number and more severe life events and daily hassles and reporting greater perceived stress at any given moment. 6 To the best of our knowledge, no existing study examined the possibility of studying chronic stress using social media data via natural language processing and machine learning. The necessary first step to study chronic stress from social media is the development of computational methods that can automatically identify a cohort of users suffering from chronic stress. Considering the fact that chronic stress affected all Americans during the pandemic and its long-term health impact post-pandemic, our study is timely and will enable population-level surveillance of chronic stress and consequent health problems for prevention and intervention.
We chose Twitter as our research platform due to its popularity and data availability. Twitter is a popular social media platform in the United States. As of July 2021, there are more than 70 million Twitter users in the United States. 22 Twitter users mostly conduct micro-blogging by posting short texts called tweets limited to 280 characters. Besides tweets, Twitter users can also engage in discussion through re-tweets, replies, or messages. Each Twitter user account contains meta-data, including name, screen name, bio, or location information, which could potentially be used for understanding the user's demographics and health problems by certain demographic features. Twitter allows researchers to apply for "developer accounts," through which they can access a standard application programming interface (API) for collecting public data but with certain limitations. Researchers can use the API to access tweets publicly posted by users and their public profiles. The standard API, however, limits the access and the data retrieval rate. For example, "user_timeline" can only retrieve the most recent 3200 tweets for a given user, and only 900 such requests can be made in a 15-minute window.
The purpose of our pilot study was to develop and evaluate an automatic system on Twitter via the Twitter API, natural language processing, and machine learning to identify tweets that contain self-reported chronic stress experiences and collect the author's past tweets along with other metadata. The system's architecture can be found in Figure 1.

METHODS
This research was approved by the institutional review board of Emory University. In this work, we attempted to develop an automatic system to continuously build a virtual cohort of people with chronic stress. Although self-report of chronic stress on Twitter is subjective as in interviews or surveys and requires validation with more objective measures, we deemed this work as our first step into developing such a system and took an agile approach. Thus, this pilot study prioritized pipeline construction, data acquisition, classification, post-classification analysis, and user data analysis as initial validation while setting a ground truth using a well-validated survey instrument is underway.

Data Collection and Text Pattern Filtering
We collected data in multiple intervals from 10/03/2019 to 03/13/2020 via the Twitter streaming API. This period represents the time when the study was conceptualized up to the day on which the first annotation sample was drawn. Our inclusion criterion was Twitter users who composed at least one posting that contained an expression of distress from "prolonged stressful events" 5 during the data collection period. Our preliminary examination found certain text patterns indicating users' distress from specified (eg, finance, relationship) or unspecified sources of stress. Those patterns included such terms as "chronic," "constant," and "stress," among others. Then, the tweets retrieved by all the specific keywords were matched against the pre-determined text patterns to reduce potential noise by filtering out a large number of tweets unlikely to be a self-report. These expressions include, for example, "i.*chronic.*stress," containing the personal pronoun "I" followed by the terms "chronic" and "stress" with any number of words or characters in between. Please see Table S1 and Table S2 (http://links.lww.com/CIN/ A219) in the Supplementary Digital Content for the full list of keywords and text patterns we used. From the full set of tweets (n = 152670), we randomly selected 4195 tweets for exploration and manual annotation that could be used for supervised classification.

Tweet Contents and Manual Annotations
We annotated the tweets as either positive (P) or negative (N), where P indicated a report of the user's own chronic stress experiences, whereas N indicated otherwise. The annotation process was iterative and involved several rounds. We studied the first few hundred tweets as a sample, developed the annotation guidelines, and then annotated the remaining tweets. Four research team members independently annotated the tweets. We discussed the disagreements, updated the guidelines, and re-annotated the tweets. This process repeated four times until we reached a substantial agreement, and our final weighted average of the pairwise inter-annotator agreement (Cohen's kappa) reached 0.83 among 695 doubly annotated tweets.
The definition and examples for each category are given as follows: • Positive (P): Tweets that include indicators of self-report (eg, personal pronouns such as "I" or "me") and the text patterns that indicate the user is under chronic stress, including "chronic stress," "constant stress," and "stress me out." An example is, "This would make a massive impact on my life! The constant stress with money is killing me." • Negative (N): Tweets that do not include indicators of self-report, such as tweets that miss a personal pronoun "I" (eg, "You must be under chronic stress."); tweets that are neutral statements (eg, "Constant stress is unhealthy."); tweets that are hypothetical (eg, "I can't imagine how stressful it would be if I were in his situation."); and tweets whose self-report status is unclear (eg, "constant headache from stress"). In the last scenario, it is not clear if "constant" moderates "headache," "stress," or both.

Classification
We experimented with five traditional classification algorithms, including Gaussian Naïve Bayes, support vector machine, random forest, k-nearest neighbor (KNN), and shallow neural network, as well as two advanced classification algorithms, recurrent neural networks with bidirectional long short-term memory (BLSTM) and Bidirectional Encoder Representations from Transformers (BERT). 23 We also established the baseline as a trivial classifier that assigned all the tweets to negative (the majority class) as reference. We used various packages in Python to construct the classification model and conduct the experiment, including Scikitlearn, Keras, transformers, and simpletransformers. Each tweet was first converted into a numerical vector representation. We first pre-processed the tweets by lowercasing and anonymizing URLs and usernames. For the traditional classifiers, the vector representations were the normalized counts of the 20000 most frequent n-grams (contiguous sequences of words with n ranging from 1 to 3). For the advanced classifiers, each word or character sequence was replaced with a dense numerical vector (such as a word embedding), and the vectors were then fed into the relevant algorithms for training.
We split the dataset into training (70%), validation (10%), and test (20%) sets. The training set is used directly in training the classification models, whereas the validation set is used in hyper-parameter tuning. Most of the classification models have hyper-parameters that can be tuned manually, such as the optimal number of training epochs for the BLSTM and the BERT, or the number of neighbors for KNN. For the advanced classifiers, we performed hyper-parameter tuning directly on the validation set because the training is often slow. For the traditional classifiers, since the training is often faster, we first merged the training and validation set (ie, using 80% of the data) and performed hyper-parameter tuning using 10-fold cross-validation on this dataset. That is, this dataset is split into 10 parts, and in each run, nine parts are used in training, whereas the remaining part is used for calculating the validation score (this step is repeated 10 times, each time with a different part being the validation set). The final validation score is the average of the 10 runs.
The reason for such a scheme is that using the average of the 10 validation scores for tuning is often more stable. The final test results were based on models trained with the optimal hyper-parameters on the dataset that merged the training and validation sets. To ensure the test performance is unbiased, we calculated the test scores on the held-out test set. The optimal hyper-parameters and technical details are provided in the section titled "Additional Details of the Machine Learning Experiments" and Table S3.
We evaluated the classification performance based on class-specific precision, recall, F 1 score (the harmonic mean of precision and recall), and accuracy. These metrics are defined as follows: We used a bootstrapping method to estimate the 95% confidence interval for F 1 scores and accuracies. We also performed a paired bootstrap resampling method to test if the best-performing classifier outperforms other classifiers on accuracy statistically ( P < .01). 24

Post-Classification Analyses
Following the classification experiments, we conducted analyses to (1) investigate the potential causes of classification errors and (2) analyze the learning curve (ie, system performance at different training data sizes). Analysis 1 is a qualitative, manual analysis. We focused on the best-performing classification algorithms and inferred the potential causes of errors by reading through the misclassified tweets. Analysis 2 is a quantitative analysis; we re-trained the algorithms on 20%, 40%, 60%, and 80% of the training data and compared the performance over the test set. The learning curve can provide valuable information regarding how much data each algorithm requires to reach the learning capacity and if annotating more data can further improve the performance.

User Data Extraction and Exploratory Analysis on Users' Past Tweets
We used the Twitter API's functionality to extract past tweets (limited to 3200 recent posts by Twitter API) from the users authoring the positive tweets in our curated chronic stress cohort. To examine the availability of additional information relevant to chronic stress beyond the expression of their state of chronic stress, for each user, we counted the number of tweets that contain search phrases "chronic/constant." We then selected 10 users with high counts and manually examined their tweets that contain search phrases "stress/pressure." Although we did not filter these tweets in the same more rigorous way as we did with the original sample, we believe this filtering scheme is appropriate to retrieve information related to the authors' chronic stress experiences, given the authors already have positive tweets.

Annotation and Class Distributions
In total, we annotated 4195 tweets, among which 2635 (62.8%) were negative whereas 1560 (37.2%) were positive (self-reports of chronic stress). The weighted average of pairwise interannotator agreement among 695 double-annotated tweets was κ = 0.83 (Cohen's kappa), which can be interpreted as substantial agreement. The data distributions for training, validation, and test sets are given in Table 1.

Classification Results
The test F 1 score and accuracy from each classifier are given in Table 2, and the precisions and recalls are provided in the Supplementary Digital Content, Table S4 (http://links.lww. com/CIN/A220). Among all the classifiers we experimented with, we found that the BERT classifier performed the best. Based on the estimated 95% confidence intervals, BERT outperforms all other classifiers statistically on F 1 score for the positive class and the accuracy score. The paired bootstrap resampling also shows that BERT outperforms all other classifiers statistically on the accuracy score (all with P < .00001). There were no significant performance differences between BLSTM, support vector machine, and random forest. However, these three algorithms still performed better than neural network, KNN, and Naïve Bayes. Since BERT was the best-performing classifier, we next analyzed the errors made by this classifier.

Error Analysis
The best classifier correctly labeled 438/527 negative tweets and 263/312 positive tweets. Our analysis found that the main sources of error were lack of context and the use of colloquial language, which is typical of social media text. The following paragraph discusses situations where the system may misclassify tweets. For the tweets annotated as negative, our classifier often misclassified those containing subtle language cues into positive. Some of the examples include (1) tweets where constant or chronic did not modify stress but some other health conditions that may have been caused by stress (eg, "constant migraines from stress")-we annotated this type of tweet as negative since it was unclear if both or which one between stress and migraines lasted for an extended time; (2) tweets describing hypothetical situations: for example, although "I'm literally always in a constant state of STRESS????" was labeled as positive by our system, it was unclear if the tweet described the author's actual situation given the question mark "????"; (3) tweets where "stress" meant "emphasize," for example, "...I can't stress this enough."; and (4) tweets too short to determine if they belonged to the positive class, for instance, "<username > Constantly stressed." In this case, the individual experiencing constant stress could be the author or others. We could not assume this was a self-report without additional information.
Reversely, due to the same reasons, our classifier sometimes misclassified positive tweets into negative. For example, "am i constantly stressed that....? Also yes" contains the author's self-report of chronic stress. However, the text is in the form of a question with an answer followed later, making our model mistakenly consider it to be a negative case. Also, "...bro is stressed out... bro is me" requires our system to first resolve co-reference ("bro is me") and then infer that the author is stressed out. Our system was not equipped to make such inferences and misclassified the tweet.

Learning Curve
The learning curve is shown in Figure 2. We found that, with 20% of the training data, all algorithms except BLSTM and KNN performed much better than baseline (62.8%). Remarkably, BERT trained on only 20% of the data performed better than all other classifiers trained on all the data, showing the benefit of fine-tuning from a large, pre-trained model. Also, as the size of training data grew, the performance of BLSTM and support vector machine improved drastically, whereas the improvement for random forest and BERT seemed moderate. In contrast, the performance of neural network, Naïve Bayes, and KNN did not improve once they reached their model capacities.

User Data Extraction and Exploratory Analysis of Users' Past Tweets
We retrieved over 700000 past tweets for 975 users whose tweets were annotated as positive. Among these, we found that many users had multiple tweets containing the terms "pressure" or "stress," some of which also included the term "chronic" or "constant," indicating that they have been experiencing stress for an extended period. Based on the 10 users reviewed, we found most of them disclosed the sources of their (chronic) stress in tweets, including work, school, interpersonal situations, or finance. Some of them even shared how the chronic stress impacted their health and caused mental health problems. The example tweets of four users are given in Table   Table 2

DISCUSSION
The BERT classifier achieved the best performance, with an accuracy of 83.6% and an F 1 score for the positive class of 0.79. These results show that leveraging large pre-trained models along with contextual embedding can enhance the classifier's understanding of the text and improve the classification performance. Since the performance was comparable with human performance (inter-annotator agreement is 0.83), our system is readily applicable for automatically collecting and filtering streaming data from Twitter and extracting past tweets along with meta-data from users who have positive tweets. Our review of 10 selected users' tweets found that users have disclosed the sources of stress as well as the health impact of their chronic stress experiences. This analysis demonstrates the utility of our system-it could not only identify Twitter users who are likely under chronic stress but also provide their data that might include useful information such as the sources of chronic stress and health consequences for future research and intervention.
In the following subsection, we discuss the future directions of improvement, the potential applicability of the system, and its limitations.

Directions for Improvement
Although the performance of our classifier is comparable with the human agreement, based on the error analysis described earlier, there is still room for improvement. The classifier tends to make mistakes when the language cues in the tweets are subtle. We propose several ways to help resolve this problem. First, it is possible to further refine the annotation guidelines to clarify the ambiguity and improve the inter-annotator agreement, particularly on tweets requiring nuanced interpretation. For example, our annotation guidelines need to answer which code should be assigned to tweets like "I don't have resting ***** face I have a constant 'I'm stressed' * face." Technically, "constant" modifies "face," not "stressed," but the sentence seems to vaguely allude that the author was under stress constantly. Also, it makes annotation more complicated when tweets at the surface show a positive attitude toward chronic stress but actually represent sarcasm, such as "It's truly the most wonderful feeling being under constant stress." Second, it may be possible to update the training strategy to further improve the performance. One direction is to artificially enrich the training set with tweets containing subtle language cues where our current system made mistakes. This method could potentially help the algorithms learn better in these error-prone situations. Also, one might be able to curate features targeting subtle language cues, such as negation detection (eg, "constant stress reliever"). 25 Moreover, using social media-specific pre-trained models may help improve performance further, although we did not explore such possibilities in this study due to the closeness of the best-performing system with the human agreement in terms of classification accuracy.
Third, since our objective is to discover a cohort of users potentially under chronic stress, we can improve the system by adding a second classifier on the user level to our system, using some of the users' past tweets as the input. Since the users under chronic stress might repeatedly post tweets relevant to

FEATURE ARTICLE
their stress and health conditions, this method could potentially improve the overall performance of the system.

The Applicability of the System
Our study shows the utility of the system to collect massive data in terms of depth and width for future research on chronic stress. The potential directions include identifying novel indicators of chronic stress (ie, linguistic markers), detecting and intervening with individuals with chronic stress to improve their stress-coping skills and prevent adverse health outcomes and unhealthy behaviors, and widely disseminating the tested interventions among Twitter users. With user characterization techniques using users' self-reported meta-data, 26 we can also estimate their demographic information (eg, race, sex, age, or even family status) and geolocation data, which will open another door to investigate unequally experienced chronic stress burdens by sociodemographic characteristic and residential area. Additionally, these capacities can allow us to target a cohort with specific demographic profiles, for example, Black females, for interventions. In summary, our developed system and social media-based approach will complement the conventional methods (eg, surveys) of measuring chronic stress and identifying and intervening with at-risk populations through a data-driven approach with a large volume of Twitter data. Thus, harnessing social media data on chronic stress can have significant public health and policy implications.

Limitations
The data limit the applicability of this work. First, our data collected from Twitter represent those who actively use Twitter and primarily use English. Therefore, it may not accurately represent the whole American population. This might introduce bias into our data and potentially affect the downstream analyses. According to the Pew Research Center, about 22% of American adults use Twitter and a higher proportion of users are young to middle aged. 27 This implies that chronic stress experiences from the analyzed tweets likely represent those among younger generations who use Twitter and might not be representative for the elder generations or those who do not actively use Twitter. On the flip side, however, the composition of Twitter users shows that our work could be a great platform targeting college students, young adults just stepping into the job market, or even people having school-age children, with issues including health, finance, school, work, interpersonal relationship, role strain, and politics. The younger populations on Twitter can be an opportunity for early identification and intervention before their chronic stress, not properly mitigated, accumulates and manifests as more serious health outcomes at some point in their lives.
Second, our data collection did not use a geolocation tool to limit the collection to US tweets. Although users can choose to provide location information while creating tweets, it is voluntary and the provided information could be fake or invalid. Based on Dredze et al 28 (2013), their system (CARMEN) can only resolve the location information of about 22% of the tweets in a sample of over 43 million tweets. Furthermore, limiting to US tweets could bring in additional bias, such as leaving out users who chose not to reveal their location. Such omission could be critical to studies on subjects that users are more comfortable to discuss anonymously, including chronic stress experiences. Therefore, we did not infer the geolocation in this study.
Third, tweets are generally short and lack context, which could compromise the annotation's accuracy because the likelihood that annotators interpret the same text differently may increase without clear rules. It also increases the difficulty of the machine-learning classification task, making the development of a high-performing model challenging.
Lastly, similar to an anonymous survey, self-reported information on social media is subjective. Considering that there is no clinical diagnosis or one gold standard measure for chronic stress, it is challenging to set a ground truth against which we evaluate the validity of the self-reported chronic stress among users identified by our system. As our future work, we will validate our system with a survey and an objective measure (eg, biomarker) to verify that users who self-reported chronic stress indeed show higher scores on the chronic stress scale and different biomarker distributions than users who did not. Despite these limitations, there may not be any other platforms available to provide the same level of reach as social media for data collection. We believe our work is an appropriate first step toward identifying chronic stress linguistic markers and predicting chronic stress among social media users to monitor and prevent adverse health outcomes and health behaviors.

CONCLUSION
In this work, we developed a system to collect Twitter chatter where the authors self-reported their state of chronic stress, as well as assessed the utility of the system by examining their past tweets. We showed that our system could automatically collect a cohort of Twitter users self-reporting chronic stress. Future research is warranted to validate our approach via objective measures, analyze the past and current tweets from this cohort to investigate the patterns of used language as linguistic markers of chronic stress, identify users with chronic stress with higher accuracy, and provide interventions at a large scale and a low cost. data and the annotation guidelines will be shared upon reasonable request to the corresponding author.