Data Analysis
Similar to extracting latent variables that capture the covariance among
psychological scales, topic models (for a review, see Blei, 2012)
extract thematic information across text responses. While alternative
analysis approaches to text responses exist (i.e., sentiment analysis),
these approaches are limited because they cannot disambiguate multiple
word meanings, motiving the use of data-driven methods such as a topic
modeling framework to obtain a finer and more nuanced representations of
semantic concepts (e.g., Pennebaker et al., 2003; Kjell et al., 2019):
topic models perform by modeling word usage across participant responses
in an attempt to find groups of words (i.e., “topics”) that commonly
co-occur. One of the most common forms of topic models is latent
Dirichlet allocation (LDA; Blei et al., 2003). In comparison to other
algorithms for computing topics, LDA has been found to generally produce
more coherent topics (Stevens et al., 2012). LDA is anunsupervised model, similar to a latent class model, because
there is no explicit outcome or predictor in the model. To relate topics
to predictors or covariates of interest, structural topic model (STM;
Roberts et al., 2016) are used, which model text with latent topics
while allowing the prevalence of each topic to be predicted by a set of
exogenous variables. In a STM model, the topic proportions are regressed
on the predictors, allowing researchers to determine whether topic
prevalence is affected by or associated with the predictors. All topic
modeling analyses were performed using the psychtm (Wilcox,
2020), stm (Roberts et al., 2019), and DirichletReg(Maier, 2021) packages in the R statistical environment (R Core Team,
2020).
Topical coherence (Mimno et al., 2011), topical exclusivity (Roberts et
al., 2019), residual dispersion, and hold-out likelihood (using 50% of
the data for training and 50% for model evaluation) were used as
goodness-of-fit metrics to choose the optimal number of topics, ranging
from 2 to 10 topics. Coherence has been shown to correlate strongly with
human ratings of topic interpretability (Mimno et al., 2011), while
exclusivity provides a measure of the uniqueness of the words prevalent
in each topic. Ideally, a good solution would provide higher coherence
and higher exclusivity scores.
The stm (Roberts et al., 2019) R package approximates the
relationships between predictors and topic proportions by a sequence of
“one vs. all” linear regressions instead of estimating and testing
with a canonical generalized linear model, which are more appropriate
for nonlinear relationships. Given this, we instead used Dirichlet
regression implemented in the DirichletReg (Maier, 2021) R
package to jointly model relationships between NSSI history, emotion
dysregulation, and the two samples. The Dirichlet regression model
included the main effect of NSSI history (-0.5 = no history, 0.5 =
history), emotion dysregulation (mean-centered) as a linear predictor,
the main effect of sample (-0.5 = undergraduate, 0.5 = community), the
three two-way interactions between NSSI history and emotion
dysregulation, and the three-way interaction between NSSI history,
emotion dysregulation, and sample to model the topic proportions.
To assess our final aim – whether the information accounted for by the
topics was related, in part, to narrative valence – we studied the
relationship between topic prevalence and narrative valence. Valence was
scored using the sentimentr (v2.9.0; Rinker, 2021) R package
given its ability to account for valence shifting features, such as
negation and amplification (i.e., words that modify the intensity of
meaning; e.g., “really”; “hardly”). Each participant’s narrative was
scored with respect to valence, where higher, positive values indicate
positive valence and lower, negative values indicate negative valence; a
score of zero is neutral (M = -0.01, SD = 0.37, Min= -1.18, Max = 1.13).
Data pre-processing . Two participants (5%) in Sample 1
(undergraduate sample) had missing scores on the measure of emotion
dysregulation. Scores were imputed for these participants using
stochastic regression imputation (e.g., Enders, 2010) in the mice(van Buuren, 2011) R package using NSSI history, participant subjective
rating of level of distress as a result of the interpersonal stressor
(i.e., “How upsetting or distressing was this event?”; response
options 1 = not at all distressing to 10 = most upset or
distressed I’ve ever been ), and their interaction as predictors during
imputation.
Before modeling the narrative responses, the raw text was pre-processed
using standard practices in computational linguistics (e.g., Manning et
al., 2008; Roberts et al., 2014) by (a) correcting misspellings; (b)
removing commonly used “stop words”11We used the stop word
list from the NLTK Python text mining library (Bird et al., 2009),
although negation words (”no”, ”nor”, ”not”, ”don’t”, ”hasn’t”,
”haven’t”, ”isn’t”, ”shouldn’t”, ”wasn’t”, ”weren’t”, ”won’t”,
”wouldn’t”) were not removed from the responses. (e.g., the, to, a,
an) and words in the question prompts; (c) removing numbers,
punctuation, and symbols; and (d) removing any words that were used
fewer than five times in the entire corpus. Narrative responses
throughout the semi-structured interview were concatenated into a single
response for each participant and experimenter utterances were removed.
One participant in the community sample did not complete the interview
and was excluded from analysis. This resulted in a total of 3,647 words
in the undergraduate sample and 45,633 words in the community sample.
After pre-processing, the average participant narrative length was 89
words (SD = 23, Median = 87, Min = 48, Max =
158) in the undergraduate sample and 275 words (SD = 162,Median = 246, Min = 22, Max = 1189) in the
community sample. Topic models and semantic measures were computed using
unigrams (i.e., individual words rather than multi-word phrases).