Data Analysis
Similar to extracting latent variables that capture the covariance among psychological scales, topic models (for a review, see Blei, 2012) extract thematic information across text responses. While alternative analysis approaches to text responses exist (i.e., sentiment analysis), these approaches are limited because they cannot disambiguate multiple word meanings, motiving the use of data-driven methods such as a topic modeling framework to obtain a finer and more nuanced representations of semantic concepts (e.g., Pennebaker et al., 2003; Kjell et al., 2019): topic models perform by modeling word usage across participant responses in an attempt to find groups of words (i.e., “topics”) that commonly co-occur. One of the most common forms of topic models is latent Dirichlet allocation (LDA; Blei et al., 2003). In comparison to other algorithms for computing topics, LDA has been found to generally produce more coherent topics (Stevens et al., 2012). LDA is anunsupervised model, similar to a latent class model, because there is no explicit outcome or predictor in the model. To relate topics to predictors or covariates of interest, structural topic model (STM; Roberts et al., 2016) are used, which model text with latent topics while allowing the prevalence of each topic to be predicted by a set of exogenous variables. In a STM model, the topic proportions are regressed on the predictors, allowing researchers to determine whether topic prevalence is affected by or associated with the predictors. All topic modeling analyses were performed using the psychtm (Wilcox, 2020), stm (Roberts et al., 2019), and DirichletReg(Maier, 2021) packages in the R statistical environment (R Core Team, 2020).
Topical coherence (Mimno et al., 2011), topical exclusivity (Roberts et al., 2019), residual dispersion, and hold-out likelihood (using 50% of the data for training and 50% for model evaluation) were used as goodness-of-fit metrics to choose the optimal number of topics, ranging from 2 to 10 topics. Coherence has been shown to correlate strongly with human ratings of topic interpretability (Mimno et al., 2011), while exclusivity provides a measure of the uniqueness of the words prevalent in each topic. Ideally, a good solution would provide higher coherence and higher exclusivity scores.
The stm (Roberts et al., 2019) R package approximates the relationships between predictors and topic proportions by a sequence of “one vs. all” linear regressions instead of estimating and testing with a canonical generalized linear model, which are more appropriate for nonlinear relationships. Given this, we instead used Dirichlet regression implemented in the DirichletReg (Maier, 2021) R package to jointly model relationships between NSSI history, emotion dysregulation, and the two samples. The Dirichlet regression model included the main effect of NSSI history (-0.5 = no history, 0.5 = history), emotion dysregulation (mean-centered) as a linear predictor, the main effect of sample (-0.5 = undergraduate, 0.5 = community), the three two-way interactions between NSSI history and emotion dysregulation, and the three-way interaction between NSSI history, emotion dysregulation, and sample to model the topic proportions.
To assess our final aim – whether the information accounted for by the topics was related, in part, to narrative valence – we studied the relationship between topic prevalence and narrative valence. Valence was scored using the sentimentr (v2.9.0; Rinker, 2021) R package given its ability to account for valence shifting features, such as negation and amplification (i.e., words that modify the intensity of meaning; e.g., “really”; “hardly”). Each participant’s narrative was scored with respect to valence, where higher, positive values indicate positive valence and lower, negative values indicate negative valence; a score of zero is neutral (M = -0.01, SD = 0.37, Min= -1.18, Max = 1.13).
Data pre-processing . Two participants (5%) in Sample 1 (undergraduate sample) had missing scores on the measure of emotion dysregulation. Scores were imputed for these participants using stochastic regression imputation (e.g., Enders, 2010) in the mice(van Buuren, 2011) R package using NSSI history, participant subjective rating of level of distress as a result of the interpersonal stressor (i.e., “How upsetting or distressing was this event?”; response options 1 = not at all distressing to 10 = most upset or distressed I’ve ever been ), and their interaction as predictors during imputation.
Before modeling the narrative responses, the raw text was pre-processed using standard practices in computational linguistics (e.g., Manning et al., 2008; Roberts et al., 2014) by (a) correcting misspellings; (b) removing commonly used “stop words”11We used the stop word list from the NLTK Python text mining library (Bird et al., 2009), although negation words (”no”, ”nor”, ”not”, ”don’t”, ”hasn’t”, ”haven’t”, ”isn’t”, ”shouldn’t”, ”wasn’t”, ”weren’t”, ”won’t”, ”wouldn’t”) were not removed from the responses. (e.g., the, to, a, an) and words in the question prompts; (c) removing numbers, punctuation, and symbols; and (d) removing any words that were used fewer than five times in the entire corpus. Narrative responses throughout the semi-structured interview were concatenated into a single response for each participant and experimenter utterances were removed. One participant in the community sample did not complete the interview and was excluded from analysis. This resulted in a total of 3,647 words in the undergraduate sample and 45,633 words in the community sample. After pre-processing, the average participant narrative length was 89 words (SD = 23, Median = 87, Min = 48, Max = 158) in the undergraduate sample and 275 words (SD = 162,Median = 246, Min = 22, Max = 1189) in the community sample. Topic models and semantic measures were computed using unigrams (i.e., individual words rather than multi-word phrases).