Figure 1. Conceptual overview of how the topic modeling analysis feeds into the structured futuring process, including worldbuilding and story creation.
2 Data and Methods
2.1 News article corpus on the future of the Arctic
We collected news articles from multiple Arctic regional news sources. These articles were available in publicly accessible, English-language Arctic newspapers, specifically: The Arctic Sounder, Arctic Today, The Barents Observer, CBC North, The Moscow Times, Nunatsiaq News, and Radio Canada International. These sources were selected based on a set of preliminary conversations with community leaders, scientific experts and public officials from throughout the Arctic. We use the Google Search engine as a method of discovery, and a temporal window of search of 2010-2020. These dates do not correspond to any specific event, but rather capture a recent, contemporary set of published perspectives on the future Arctic.
For most sources, we simply used the search term ‘future’ as a filter of the articles, given the publication itself was an ‘arctic’ publication. For Radio Canada International and The Moscow Times, we used both ‘arctic’ and ‘future’ as a filter. While the language of the sources was restricted to English-language texts, there are news resources coming from the entire pan-Arctic region, including Russia, Finland, Sweden, Norway, Iceland, Canada, Greenland, and Alaska. The purpose for this broad collection is to ensure that the information from the regional (i.e. spatially extensive, less granular, more general) and the local (i.e. spatially specific, more detailed, deeper knowledge), spans the possibility space of a large fraction of the Arctic discourse about the future, available in English-language newspapers. While additional words could have been used, such as ‘projection’, ‘forecast’, or ‘scenario’, we were intentional about using a simple, straightforward — and hopefully repeatable — procedure. We rely on the word ’future’ as our primary search term, for multiple reasons. First, ’future’ is commonly used as both a noun and an adjective. As a noun, ’future’ is defined as ”the time that will come after the present or the events that will happen then” (OED Online, 2021). Likewise, as an adjective, ’future’ is defined as ”That is to be, or will be, hereafter.” These definitions are precisely the meaning we are after. Second, there are very few nouns in the English language that are commonly used to convey this definition, with less commonly used words including ”hereafter” and ”tomorrow”. There are some synonyms of ’future’ as an adjective, though these tend to be less precise (words include: anticipate, expected). Thus, the simple choice of ’future’ in this case allows us to target the definition of the word future, with a straightforward approach.
While the Google Search algorithm often displays tens of thousands of search returns, a user only has access to approximately 300 entries. Thus, we limit the collection of articles to the top 300 articles returned from each source. The order of entries from a Google search employs the proprietary PageRank algorithm that sorts and orders webpages based on the number of other links to them on the internet. This will inevitably lead to some skew in the types of articles that are listed at the top, and we anticipate that future work could use alternative search approaches that would potentially yield different sets of articles. That is, however, beyond the scope of this work. Ultimately, by using Google Search we ensure the method is free and generally easy for others to use, without any fee-based licenses.
The potential for biases exists in multiple aspects of this corpus. First, we only use English-language texts (or texts that were translated into English). As a result, there is an anglophone bias that may include implicit or tacit perspectives that are difficult to surface, not least related to the legacy of settler colonialism in Indigenous communities in the Arctic. Along with this, we note that the LDA implicitly absorbs biases of the corpus itself, including: the vocabulary of the news article authors, the political biases of the article authors themselves, and the editorial biases of the news sources and their publishers. In acknowledging the various contexts that constrain the creation of these source articles (D’ignazio and Klein, 2020), we aim to definitively underline that this machine learning method contains its own variety of bias.
Second, we are implicitly adopting the biases of the search algorithm that was used — in this case a Google-based search. While this is certainly a bias, we intend for this method to be theoretically accessible to anyone, which would exclude many scholarly search products, which often require paid subscriptions or memberships to specialized organizations. While other forms of written information about the future of the Arctic region exist (such as journal articles, reports, or other forms of information), we wanted to collect publicly-available and currently discussed ideas about the future. Freely available news articles serve this aim by providing publicly available information.
Finally, we note that the composition of the research team is not local or Indigenous to the Arctic, and is only capable of interpreting English-language texts. Moreover, since the goal was to generate written stories from computational text analysis, both the input and the output needed to be in English. Also, while a future goal of this type of work could include engagement of local and Indigenous communities in the Arctic, this work aims to demonstrate the method’s potential, by using publicly available documents written for and by an audience that exists in, or is concerned with, the Arctic. That being said, the data should be interpreted as originating from texts that were written from specific positions of cultural power throughout the Arctic and near-Arctic, rather than representative of a reality devoid of these power dynamics (D’ignazio and Klein, 2020).
2.2 Text preparation and conversion
Each document was saved in a plaintext file, and manually stripped of erroneous material that did not pertain to the article itself, including strings of characters associated with unrelated website HTML (i.e., Hypertext Markup Language), unrelated publication text, or advertisements. Once the corpus of texts was identified, we generated a machine-readable corpus using Python-based scripts that can batch-convert documents to text strings. It is important to note that the corpus is being used for educational and research purposes only, and that the corpus itself is not publicly distributed.
2.3 Latent Dirichlet allocation (LDA)
Using the GENSIM package, in addition to several other Python-based tools, we performed the tasks of converting the strings of text into a vectorized set of inputs for analysis, including tokenization, lemmatization, and stop-word filtering (Řehůřek & Sojka, 2010; Sarkar, 2019). Next, we performed the latent Dirichlet allocation (LDA). LDA is a machine-learning based approach for taking a large corpus of texts, and revealing the latent (i.e., hidden), patterns of keywords and topics that occur across the corpus. Below, is a more detailed explanation of the process, including the corresponding versions of each software package used. We note that multiple methods of text analysis could be suitable for identifying semantically distinct topics from a large corpus, such as latent semantic analysis (LSA) or latent semantic indexing (LSI). We employ LDA primarily because it produces highly interpretable topics (Kayser and Shala, 2014) with intuitive topic visualization options (Sievert and Shirley, 2014).
We use Python version 3.7.7 for this entire analysis. The initial step for the LDA is to pre-process each document using the Python-based Natural Language Toolkit (NLTK) version 3.4.4. Tokenization is the initial step of breaking the text within each document in the corpus into the individual units of meaning, in this case, individual words. Stopwords are then removed, including frequently used words such as “the”, “and”, “as”, etc. As a note we used both Gensim and NLTK since the tokenization process in Gensim led to spurious words that persistently passed through the tokenization filters that we implemented using Gensim. NLTK was used as a substitute since it was more effective at the tokenization procedure, for our specific task.
Lemmatization is the final step in the corpus preparation, which helps reduce the remaining words to their basic form, e.g., changing past tense versions of a word to a common form. These steps result in a tokenized corpus of texts. It is possible that lemmatization eliminates potentially valuable temporal context for articles, specifically signaling past, present, and future tenses. However, in this analysis we wanted to maximize the set of distinct words to characterize the future, rather than find (potentially) multiple words with the same root albeit with different tenses. Future work could explore this question, though it is outside the scope of this work.
Next, the Gensim Python package (version 3.8.0) is used for the LDA, which is a method that iteratively identifies the latent topic structure across the corpus. This is completed by repeatedly evaluating sets of words, and learning which clusterings lead to coherent, distinct, topics. There are several parameters that can be adjusted, but the most consequential for our work is the number of topics that are being sought in the analysis. We performed a sensitivity analysis, varying the number of topics for which the corpus was clustered, and calculated the resulting Coherence scores for all analyzed numbers of clusters. Coherence measures the degree of semantic similarity, which helps distinguish between statistical artefacts and actual semantic relatedness. So, a better Coherence score implies greater semantic similarity among the terms in each topic cluster. There are several metrics for Coherence, and based on a systematic review of various coherence metrics, we employ the C_v metric (Röder et al., 2015). For our purposes, we aimed for each topic to be internally, semantically, similar enough to support a coherent storytelling process. Thus, we aimed for the highest Coherence score.
2.4 Visualize LDA results and identify scenario seeds
Using the pyLDAvis package in Gensim (Sievert and Shirley, 2014), we visualized the Intertopic Distance Map, based on a principal component analysis calculated within pyLDAvis. Additionally, we show the 30 most relevant terms for each topic. The Intertopic Distance Map was leveraged as a quadrant space (i.e., the four sections created by the two axes) and we labeled the axes to provide scenario context based on exogenous drivers; this is further elaborated in Section 3.3 below. The process of creating a quadrant space to construct scenarios is a central feature of many scenario analyses (Raven, 2014; Raven and Elahi, 2015; Merrie et al., 2018).
2.5 Employ structured futuring methods to take the LDA to a story-based scenario
We develop a process for creatively blending the topic’s keywords and the topic’s context, to construct a novel scenario world (i.e., the setting of the story), produce characters who inhabit this scenario world, and develop a brief plot. The first step is the same for all scenarios:
Define the axes of the Intertopic distance map: Label the axes of the principal component quadrants to define overarching context for each scenario, to ensure that themes which are close to one another in the principal component quadrant space are similar in some way, while those far apart are dissimilar. Build from existing work that employs similar scenario quadrants (Raven, 2014; Raven and Elahi, 2015; Merrie et al., 2018).
The subsequent steps are repeated for each scenario, though the details of each diverge according to the topics and keywords identified for each scenario (a full example of these steps are given in Section 3.5). Relevant references corresponding to each step are provided below:
  1. Summarize keywords : Examine the set of 30 keywords for the topic, and manually summarize into a core topic. If there is a specific location(s), use this to provide a setting for the world.
  2. Distill core topic : Based on the keyword summary, identify a suitable core topic (Kwon et al., 2017).
  3. Explore topic and keywords with futures wheels : Based on the core topic, the keywords, and the intertopic context, brainstorm how the ideas might be connected to one another in the future. Look for both logical and contradictory connections (Pereira et al., 2018).
  4. Use 3-horizons framework to build a future history : Placing the futures wheels brainstorm at the end of the third horizon, begin to identify how the world has transformed from the present day to the hypothetical future world. Identify key events or changes that had to unfold to get from the present to the future (Sharpe et al., 2016).
  5. Probe reality and cultural change : Zoom-out from the specific scenario world that is emerging, and explore what changes exist in governance, education, culture, the arts, economy, and more (Hamann et al., 2020).
  6. Push toward ridiculousness : Select several of the keywords or other ideas and identify the most radical technological or social changes that could unfold in the future. Include some of these in the scenario (Dator, 1993; Merrie et al., 2018).
  7. Visualize character(s) : Take the nascent future world and visualize a scene from the world. Explore the type of character that is revealed in this scene and articulate what the character is doing in the visualization. Based on this, define relevant attributes for understanding this character (internal and external motivations; fears and hopes; past experiences; etc.).
  8. Design plot based on world and character : Based on the character and the world, identify a challenge that could emerge that would allow the character to change or adapt in some way. Then, identify how a character might deal with such a challenge (Johnson, 2011).
  9. Build story beats : Use the character and the basic plot to articulate the story-beats that will form the scaffold of the story. Story beats include: Every day…, Until one day…, Because of this…, Because of that…, Until finally…, and Ever since then…
  10. Write story : Using the story-beat scaffold, begin writing the creative story of the character moving through the world, responding to a challenge, and navigating the consequences of these actions.
  11. Test story for fidelity : Ensure that the resultant story contains key elements from the LDA analysis, including the intertopic context, the topic keywords, and the core topic.
In an effort to critically participate in the representation of diverse situations in the Arctic, we intentionally and explicitly represent storylines that take place in different social, cultural, and national contexts. We also intentionally represent characters of different ages, classes, and genders to provide a critical lens through which to view issues of social power. By exposing these goals, we hope to clarify the importance of situating stories thoughtfully, particularly in parts of the world populated by people who have been historically marginalized or harmed.
3 Results
3.1 Corpus construction
We collected 2,058 articles from our set of Arctic news sources. Each source provided 300 news articles, except for Arctic Today, in which we found a total of 258 articles in our search. More articles were returned that were published toward the end of the decade, than toward the beginning, which is possibly a result of the Google Search algorithm (Fig S11). Each article was saved in a plaintext format, and its metadata was recorded, which is available in Supplemental Table 1. The 2,058 articles were then batch-converted from plaintext files to machine-readable strings. Other methods exist for characterizing corpus composition, including bi- and tri-gram frequency (the frequency of certain sequences of word pairs, or triads). In our work, however, we employ the term frequencies that are specifically related to each of the LDA-derived topics.
3.2 Computational topic modeling
We iteratively performed the LDA by varying the number of topics, and measuring the corresponding Coherence score (discussed in Section 2.3; see Fig S12). The highest Coherence score in our analysis was 0.54, and was achieved with eleven clusters. It is worth noting that Coherence is a relative metric related to the corpus itself. Additional information on the statistical optimization of latent Dirichlet allocation (LDA) methods is discussed in depth in other work (Chang et al., 2009; Hecking & Leydesdorff, 2018; Röder et al. 2015). Thus, the result of our LDA analysis was eleven semantically different topic clusters.
Topic coherence is one way to quantitatively assess how well the topic model is able to capture distinct topics with distinct sets of meaning, i.e., semantic similarity. Given that the purpose of the LDA in this research is to feed directly into a semantically meaningful task, i.e., creative storytelling, the qualitative process we employ serves as a second, though informal, metric of whether the identified topics are semantically meaningful. Indeed, topic eleven was ignored in this analysis given its clear lack of meaning, despite being identified by the LDA as a distinct topic. Other methods exist for detecting meaningfulness in a corpus, including word or topic intrusion (Chang et al., 2009). While beyond the scope of this research, since the authors themselves served as the human test of whether a topic was semantically meaningful or not, these methods could be a useful complement in research where there is no subsequent step that assesses topic meaning.
The LDA produced a variety of results, including a set of overall term frequencies across the entire corpus (Fig 2, right side), a set of latent topics composed of keywords, as well as various measures of ‘intertopic’ distance in a set of principal component axes (Fig 2, left side). The spread of the topics is not uniform across the distance map, which highlights that some of the topics may be more related to one another than not. This is not a problem and will be leveraged, as discussed in the next section. The Intertopic Distance Map shows that the first ten topics represent substantial portions of the corpus, while the eleventh topic (while quantitatively unique), contains text that is two orders of magnitude lower in frequency across the corpus. Thus, we ignored the eleventh topic, and are left with ten distinct topics.