Figure 1. Conceptual overview of how the topic modeling analysis
feeds into the structured futuring process, including worldbuilding and
story creation.
2 Data and Methods
2.1 News article corpus on the future of the Arctic
We collected news articles from multiple Arctic regional news sources.
These articles were available in publicly accessible, English-language
Arctic newspapers, specifically: The Arctic Sounder, Arctic Today, The
Barents Observer, CBC North, The Moscow Times, Nunatsiaq News, and Radio
Canada International. These sources were selected based on a set of
preliminary conversations with community leaders, scientific experts and
public officials from throughout the Arctic. We use the Google Search
engine as a method of discovery, and a temporal window of search of
2010-2020. These dates do not correspond to any specific event, but
rather capture a recent, contemporary set of published perspectives on
the future Arctic.
For most sources, we simply used the search term ‘future’ as a filter
of the articles, given the publication itself was an ‘arctic’
publication. For Radio Canada International and The Moscow Times, we
used both ‘arctic’ and ‘future’ as a filter. While the language of the
sources was restricted to English-language texts, there are news
resources coming from the entire pan-Arctic region, including Russia,
Finland, Sweden, Norway, Iceland, Canada, Greenland, and Alaska. The
purpose for this broad collection is to ensure that the information from
the regional (i.e. spatially extensive, less granular, more general) and
the local (i.e. spatially specific, more detailed, deeper knowledge),
spans the possibility space of a large fraction of the Arctic discourse
about the future, available in English-language newspapers. While
additional words could have been used, such as ‘projection’, ‘forecast’,
or ‘scenario’, we were intentional about using a simple, straightforward
— and hopefully repeatable — procedure. We rely on the word ’future’
as our primary search term, for multiple reasons. First, ’future’ is
commonly used as both a noun and an adjective. As a noun, ’future’ is
defined as ”the time that will come after the present or the events that
will happen then” (OED Online, 2021). Likewise, as an adjective,
’future’ is defined as ”That is to be, or will be, hereafter.” These
definitions are precisely the meaning we are after. Second, there are
very few nouns in the English language that are commonly used to convey
this definition, with less commonly used words including ”hereafter” and
”tomorrow”. There are some synonyms of ’future’ as an adjective, though
these tend to be less precise (words include: anticipate, expected).
Thus, the simple choice of ’future’ in this case allows us to target the
definition of the word future, with a straightforward approach.
While the Google Search algorithm often displays tens of thousands of
search returns, a user only has access to approximately 300 entries.
Thus, we limit the collection of articles to the top 300 articles
returned from each source. The order of entries from a Google search
employs the proprietary PageRank algorithm that sorts and orders
webpages based on the number of other links to them on the internet.
This will inevitably lead to some skew in the types of articles that are
listed at the top, and we anticipate that future work could use
alternative search approaches that would potentially yield different
sets of articles. That is, however, beyond the scope of this work.
Ultimately, by using Google Search we ensure the method is free and
generally easy for others to use, without any fee-based licenses.
The potential for biases exists in multiple aspects of this corpus.
First, we only use English-language texts (or texts that were translated
into English). As a result, there is an anglophone bias that may
include implicit or tacit perspectives that are difficult to surface,
not least related to the legacy of settler colonialism in Indigenous
communities in the Arctic. Along with this, we note that the LDA
implicitly absorbs biases of the corpus itself, including: the
vocabulary of the news article authors, the political biases of the
article authors themselves, and the editorial biases of the news sources
and their publishers. In acknowledging the various contexts that
constrain the creation of these source articles (D’ignazio and Klein,
2020), we aim to definitively underline that this machine learning
method contains its own variety of bias.
Second, we are implicitly adopting the biases of the search algorithm
that was used — in this case a Google-based search. While this is
certainly a bias, we intend for this method to be theoretically
accessible to anyone, which would exclude many scholarly search
products, which often require paid subscriptions or memberships to
specialized organizations. While other forms of written information
about the future of the Arctic region exist (such as journal articles,
reports, or other forms of information), we wanted to collect
publicly-available and currently discussed ideas about the future.
Freely available news articles serve this aim by providing publicly
available information.
Finally, we note that the composition of the research team is not local
or Indigenous to the Arctic, and is only capable of interpreting
English-language texts. Moreover, since the goal was to generate written
stories from computational text analysis, both the input and the output
needed to be in English. Also, while a future goal of this type of work
could include engagement of local and Indigenous communities in the
Arctic, this work aims to demonstrate the method’s potential, by using
publicly available documents written for and by an audience that exists
in, or is concerned with, the Arctic. That being said, the data should
be interpreted as originating from texts that were written from specific
positions of cultural power throughout the Arctic and near-Arctic,
rather than representative of a reality devoid of these power dynamics
(D’ignazio and Klein, 2020).
2.2 Text preparation and conversion
Each document was saved in a plaintext file, and manually stripped of
erroneous material that did not pertain to the article itself, including
strings of characters associated with unrelated website HTML (i.e.,
Hypertext Markup Language), unrelated publication text, or
advertisements. Once the corpus of texts was identified, we generated a
machine-readable corpus using Python-based scripts that can
batch-convert documents to text strings. It is important to note that
the corpus is being used for educational and research purposes only, and
that the corpus itself is not publicly distributed.
2.3 Latent Dirichlet allocation (LDA)
Using the GENSIM package, in addition to several other Python-based
tools, we performed the tasks of converting the strings of text into a
vectorized set of inputs for analysis, including tokenization,
lemmatization, and stop-word filtering (Řehůřek & Sojka, 2010; Sarkar,
2019). Next, we performed the latent Dirichlet allocation (LDA). LDA is
a machine-learning based approach for taking a large corpus of texts,
and revealing the latent (i.e., hidden), patterns of keywords and topics
that occur across the corpus. Below, is a more detailed explanation of
the process, including the corresponding versions of each software
package used. We note that multiple methods of text analysis could be
suitable for identifying semantically distinct topics from a large
corpus, such as latent semantic analysis (LSA) or latent semantic
indexing (LSI). We employ LDA primarily because it produces highly
interpretable topics (Kayser and Shala, 2014) with intuitive topic
visualization options (Sievert and Shirley, 2014).
We use Python version 3.7.7 for this entire analysis. The initial step
for the LDA is to pre-process each document using the Python-based
Natural Language Toolkit (NLTK) version 3.4.4. Tokenization is the
initial step of breaking the text within each document in the corpus
into the individual units of meaning, in this case, individual words.
Stopwords are then removed, including frequently used words such as
“the”, “and”, “as”, etc. As a note we used both Gensim and NLTK
since the tokenization process in Gensim led to spurious words that
persistently passed through the tokenization filters that we implemented
using Gensim. NLTK was used as a substitute since it was more effective
at the tokenization procedure, for our specific task.
Lemmatization is the final step in the corpus preparation, which helps
reduce the remaining words to their basic form, e.g., changing past
tense versions of a word to a common form. These steps result in a
tokenized corpus of texts. It is possible that lemmatization eliminates
potentially valuable temporal context for articles, specifically
signaling past, present, and future tenses. However, in this analysis we
wanted to maximize the set of distinct words to characterize the future,
rather than find (potentially) multiple words with the same root albeit
with different tenses. Future work could explore this question, though
it is outside the scope of this work.
Next, the Gensim Python package (version 3.8.0) is used for the LDA,
which is a method that iteratively identifies the latent topic structure
across the corpus. This is completed by repeatedly evaluating sets of
words, and learning which clusterings lead to coherent, distinct,
topics. There are several parameters that can be adjusted, but the most
consequential for our work is the number of topics that are being sought
in the analysis. We performed a sensitivity analysis, varying the number
of topics for which the corpus was clustered, and calculated the
resulting Coherence scores for all analyzed numbers of clusters.
Coherence measures the degree of semantic similarity, which helps
distinguish between statistical artefacts and actual semantic
relatedness. So, a better Coherence score implies greater semantic
similarity among the terms in each topic cluster. There are several
metrics for Coherence, and based on a systematic review of various
coherence metrics, we employ the C_v metric (Röder et al.,
2015). For our purposes, we aimed for each topic to be internally,
semantically, similar enough to support a coherent storytelling process.
Thus, we aimed for the highest Coherence score.
2.4 Visualize LDA results and identify scenario seeds
Using the pyLDAvis package in Gensim (Sievert and Shirley, 2014), we
visualized the Intertopic Distance Map, based on a principal component
analysis calculated within pyLDAvis. Additionally, we show the 30 most
relevant terms for each topic. The Intertopic Distance Map was leveraged
as a quadrant space (i.e., the four sections created by the two axes)
and we labeled the axes to provide scenario context based on exogenous
drivers; this is further elaborated in Section 3.3 below. The process of
creating a quadrant space to construct scenarios is a central feature of
many scenario analyses (Raven, 2014; Raven and Elahi, 2015; Merrie et
al., 2018).
2.5 Employ structured futuring methods to take the LDA to a story-based
scenario
We develop a process for creatively blending the topic’s keywords and
the topic’s context, to construct a novel scenario world (i.e., the
setting of the story), produce characters who inhabit this scenario
world, and develop a brief plot. The first step is the same for all
scenarios:
Define the axes of the Intertopic distance map: Label the axes
of the principal component quadrants to define overarching context for
each scenario, to ensure that themes which are close to one another in
the principal component quadrant space are similar in some way, while
those far apart are dissimilar. Build from existing work that employs
similar scenario quadrants (Raven, 2014; Raven and Elahi, 2015; Merrie
et al., 2018).
The subsequent steps are repeated for each scenario, though the details
of each diverge according to the topics and keywords identified for each
scenario (a full example of these steps are given in Section 3.5).
Relevant references corresponding to each step are provided below:
- Summarize keywords : Examine the set of 30 keywords for the
topic, and manually summarize into a core topic. If there is a
specific location(s), use this to provide a setting for the world.
- Distill core topic : Based on the keyword summary, identify a
suitable core topic (Kwon et al., 2017).
- Explore topic and keywords with futures wheels : Based on the
core topic, the keywords, and the intertopic context, brainstorm how
the ideas might be connected to one another in the future. Look for
both logical and contradictory connections (Pereira et al., 2018).
- Use 3-horizons framework to build a future history : Placing the
futures wheels brainstorm at the end of the third horizon, begin to
identify how the world has transformed from the present day to the
hypothetical future world. Identify key events or changes that had to
unfold to get from the present to the future (Sharpe et al., 2016).
- Probe reality and cultural change : Zoom-out from the specific
scenario world that is emerging, and explore what changes exist in
governance, education, culture, the arts, economy, and more (Hamann et
al., 2020).
- Push toward ridiculousness : Select several of the keywords or
other ideas and identify the most radical technological or social
changes that could unfold in the future. Include some of these in the
scenario (Dator, 1993; Merrie et al., 2018).
- Visualize character(s) : Take the nascent future world and
visualize a scene from the world. Explore the type of character that
is revealed in this scene and articulate what the character is doing
in the visualization. Based on this, define relevant attributes for
understanding this character (internal and external motivations; fears
and hopes; past experiences; etc.).
- Design plot based on world and character : Based on the
character and the world, identify a challenge that could emerge that
would allow the character to change or adapt in some way. Then,
identify how a character might deal with such a challenge (Johnson,
2011).
- Build story beats : Use the character and the basic plot to
articulate the story-beats that will form the scaffold of the story.
Story beats include: Every day…, Until one day…,
Because of this…, Because of that…, Until
finally…, and Ever since then…
- Write story : Using the story-beat scaffold, begin writing the
creative story of the character moving through the world, responding
to a challenge, and navigating the consequences of these actions.
- Test story for fidelity : Ensure that the resultant story
contains key elements from the LDA analysis, including the intertopic
context, the topic keywords, and the core topic.
In an effort to critically participate in the representation of diverse
situations in the Arctic, we intentionally and explicitly represent
storylines that take place in different social, cultural, and national
contexts. We also intentionally represent characters of different ages,
classes, and genders to provide a critical lens through which to view
issues of social power. By exposing these goals, we hope to clarify the
importance of situating stories thoughtfully, particularly in parts of
the world populated by people who have been historically marginalized or
harmed.
3 Results
3.1 Corpus construction
We collected 2,058 articles from our set of Arctic news sources. Each
source provided 300 news articles, except for Arctic Today, in which we
found a total of 258 articles in our search. More articles were returned
that were published toward the end of the decade, than toward the
beginning, which is possibly a result of the Google Search algorithm
(Fig S11). Each article was saved in a plaintext format, and its
metadata was recorded, which is available in Supplemental Table 1. The
2,058 articles were then batch-converted from plaintext files to
machine-readable strings. Other methods exist for characterizing corpus
composition, including bi- and tri-gram frequency (the frequency of
certain sequences of word pairs, or triads). In our work, however, we
employ the term frequencies that are specifically related to each of the
LDA-derived topics.
3.2 Computational topic modeling
We iteratively performed the LDA by varying the number of topics, and
measuring the corresponding Coherence score (discussed in Section 2.3;
see Fig S12). The highest Coherence score in our analysis was 0.54, and
was achieved with eleven clusters. It is worth noting that Coherence is
a relative metric related to the corpus itself. Additional information
on the statistical optimization of latent Dirichlet allocation (LDA)
methods is discussed in depth in other work (Chang et al., 2009; Hecking
& Leydesdorff, 2018; Röder et al. 2015). Thus, the result of our LDA
analysis was eleven semantically different topic clusters.
Topic coherence is one way to quantitatively assess how well the topic
model is able to capture distinct topics with distinct sets of meaning,
i.e., semantic similarity. Given that the purpose of the LDA in this
research is to feed directly into a semantically meaningful task, i.e.,
creative storytelling, the qualitative process we employ serves as a
second, though informal, metric of whether the identified topics are
semantically meaningful. Indeed, topic eleven was ignored in this
analysis given its clear lack of meaning, despite being identified by
the LDA as a distinct topic. Other methods exist for detecting
meaningfulness in a corpus, including word or topic intrusion (Chang et
al., 2009). While beyond the scope of this research, since the authors
themselves served as the human test of whether a topic was semantically
meaningful or not, these methods could be a useful complement in
research where there is no subsequent step that assesses topic meaning.
The LDA produced a variety of results, including a set of overall term
frequencies across the entire corpus (Fig 2, right side), a set of
latent topics composed of keywords, as well as various measures of
‘intertopic’ distance in a set of principal component axes (Fig 2, left
side). The spread of the topics is not uniform across the distance map,
which highlights that some of the topics may be more related to one
another than not. This is not a problem and will be leveraged, as
discussed in the next section. The Intertopic Distance Map shows that
the first ten topics represent substantial portions of the corpus, while
the eleventh topic (while quantitatively unique), contains text that is
two orders of magnitude lower in frequency across the corpus. Thus, we
ignored the eleventh topic, and are left with ten distinct topics.