# No Title Found

AbstractNo Abstract Found

#1

\nofront\nolistoftables\nolistoffigures\msc\twosupervisors\copyrightyear

2015 \submitdate31 July 2015

The Emergence of Memes in Social Networks

\supervisor

Paul Kennedy and Thomas Osborn \examinerChairperson of Research Degrees Committee\firstreaderThomas Osborn

\beforepreface
###### Contents:
• 1 Introduction
\nonumchapter

Abstract In recent years, research into social networks has rapidly progressed with the application of methods from the study of complex systems. Widespread user adoption of online social networks has created rich data sets regarding the structure and dynamics of social networks. The emergence of memes and viral content is frequently observed in online social media. However, research into complex, global and emergent phenomena is still an open area. The thesis of this dissertation is that the emergence of memes can be simulated in online social networks (OSN’s) by utilising multi-agent systems constructed to represent the network. This thesis is developed by first delivering novel methods for link prediction within online social networks. We show that link prediction can be used to classify the properties of a network, and also define the rule set for a multi-agent representation of the network. Using real world online social network data from Twitter and Facebook, we develop a prediction model for heterogeneous OSN users to engage with heterogeneous content. This model is then used to derive the rule set for a multi-agent system representing a real world OSN. We demonstrate how simulations of viral content and memes compare to real world cases. The results of this dissertation provide a better understanding of the complex dynamics influencing user activity within OSN’s. This research can better inform social marketing and advertising, deliver more effective recommendation systems in online social networks, and provide a basis for sociological theories with greater predictive power. \afterpreface

## Introduction

\label{cha:intro}

This thesis is focused on predicting the emergence of memes and viral phenomena within online social networks. Network effects play a strong role in the sudden and extensive propogation of viral content, so this aim is achieved by incorporating representative information regarding the network topology. To account for the diverse range of users of online social networks, as well as content and platforms, we adopt a multi-agent system approach.

There is a wealth of research already existing regarding social network analysis, which we review and consider in developing our method. The topology of social networks, in particular, represents a plethora of measures and techniques to classify and quantify structural properties. The diffusion dynamics across social networks has also received a great deal of attention, with many methodologies proposed. Finally, viral diffusion and the emergence of memes has attracted recent attention in the literature. However, there is still a research gap around the prediction of content and ideas diffusing in a critical, viral fashion, across a social network.

## Social Network Models

The study of social networks stretches back into the sociological tradition of the early $$20^{th}$$ century. Much of this early work was focused around very small networks, and was merely descriptive in nature. Many sociological researchers focused their efforts on simple network visualisations and analysis of key influencers, and did not venture far into the mathematical study of networks. However, they offered insights into how the shape or structure of a social network can affect the exchange of information.

A social network can be described as mathematical graph consisting of a set of nodes connected by a set of edges. Many of the developments in the modern, data driven study of social networks have progressed in line with developments in graph theory. The origins of graph theory date back to Euler in the $$18^{th}$$ century, who’s solution to the Konigsberg bridge problem is considered to be the first theorem in graph theory. However, it wasn’t until Erdős et al. (1957) introduced the mathematics of random graphs that the field started to be applied to social networks. In the random graph model, nodes are connected with uniform probability to other nodes. As the connection probability increases, a connectivity phase-transition is observed. This model introduced the application of mathematics to the study of social networks, and allowed for analytical solutions to questions around the connection structure of the network to be answered.

While the random graph model allows for mathematical theory to be derived, it does not explain many properties observed in real world social networks. Perhaps the most famous such property is Stanley Milgram’s ‘six degrees of separation’, also known as the ‘small-world’ effect (Travers et al., 1967). The concept is that real-world social networks have a surprisingly short diameter, i.e. you can reach any other person in the world in a small number of steps. Further to the small world property, in many social networks the distribution for the number of connections per node follows a power law, rather than the uniform distribution assumed by random graph theory. This latter property is also observed in certain physical systems. Physicists addressed these properties by developing what is now known as the field of complex systems.

The similarities between complex physical systems and social networks sparked the interest of a group of physicists who had previously worked on complex systems. Firstly, Watts et al. (1998) published their small-world model which produces the observed small network diameters. Soon after, Barabasi et al. (1999) addressed the common power-law degree distribution property by introducing their scale-free network model. Both of these models can be considered to be multi-agent representations of social networks, with simple rule sets. Following from these models, a wide variety of research has been published regarding the properties and applications of social networks. However, there is still an open research gap regarding the set of rules governing these models.

## Network Topology

The study of network topology deals with aspects of the network connection structure, and has seen a great deal of interest since the work of Watts et al. (1998) and Barabasi et al. (1999). The topological analysis of social networks has formed around the broad categories of macro network properties, node and edge influence, and complex community structure.

Firstly, there are a variety of properties that can be measured and classified for a whole network. These include measures such as a network’s diamater, density, clustering, and degree distribution. Analysis of these properties is important with respect to creating realistic models of social networks, and certainly directed the development of the small-world and scale-free models.

Much of the early research in social networks focused around understanding the influence of nodes, and particularly nodes in positions of information brokerage. With the tools of graph theory, structural influence can be defined using a variety of measures known as centrality measures. Centrality measures attempt to quantify the relative influence of a node over a network. One of the simplest of these is the degree of a node, defined as the number of edges that a node has. There are many common centrality measures used in network science, and they carry useful information regarding network structure.

Soon after the small-world and scale-free models were published, a multi-disciplinary group of researchers started to approach social networks with the aid of computational algorithms. One of the first applications of this approach was to partition a network into a set of densely connected subsets. This is known as the problem of community detection, and a large body of research has been published to address it. Some of the most common algorithms are based on centrality measures, and seek to either divide communities at edges with low influence, or center communities at nodes with high influence. For example, Girvan et al. (2002) proposed an algorithm that defines communities by iteratively removing nodes with the lowest betweenness centrality scores. Further to the development of community detection algorithms, the analysis of hierarchical community structure inherent within many social networks has generated a lot of interest. An example of this is the classification of scaling laws at different levels of the community structure, and the development of theory as to why this occurs.

Understanding and desribing the full topology of a network is still a challenging and unresolved task, due to the large number of dimensions and inter-connectivity of networks. Methods to describe and express the information contained in a network’s topology are therefore critical to modelling processes taking place.

## Network Diffusion

The concept of diffusion on a social network is important due to its many practical applications. Network diffusion models describe processes taking place on the network, such as the flow of information or spread of an infectious disease. Some well established methods for modelling diffusion involve dynamical systems models, which describe the interaction between groups within a population. Examples include the Susceptible-Infected-Susceptible (SIS) model for the spread of infectious diseases, and the Bass diffusion model of innovation. These dynamical systems techniques work well in cases with limited data as they do not require explicit knowledge of the network topology.

The other key approach is to model each individual node as an agent which acts subject to pre-defined rules. This multi-agent system approach generally requires more information about the underlying connection structure, but can model complex and interdependent behaviour. Similarly to multi-agent models of evolving social networks, such as the small-world and scale-free models, defining the rule sets is still an open research area. Interesting global behaviours emerge out of the multi-agent approach, such as information rapidly diffusing through large parts of the network at critical parameter values and conditions. This phenomena is generally well known by users of OSNs as the viral spread of content and memes. Research into critical diffusion within a social network is an active area of study given the recent availability of OSN data.

### Viral Diffusion and Memes

An understanding of the diffusion dynamics of a social network is particularly pertinent in the modern age of online social media. We are constantly immersed in content distributed through our online networks. At some level, we would all have experienced the phenomena of viral content. The related concept of a meme, and particularly their self-propagation, can be traced back to Dawkins (1976). Dawkins defined a meme as a self-replicating unit of cultural information or thought, passed from person to person by means of imitation. In a social network, this can be as simple as sharing or retweeting a piece of content, but can also involve more complex ideas communicated through natural language.

Recently, the question of whether we can predict the emergence of memes in a social network has received a good deal of interest. One may consider properties of the content itself, and the context within which it arises, as key predictors of it’s viral diffusion. Also, how the content diffuses through a network has been shown to be crucial. The structure of the network itself can therefore be critical in determining the velocity and extent of diffusion. For example, Weng et al. (2014) proposes using a supervised learning approach to predict which memes spread extensively with features derived from the network and community structure.

Viral diffusion and the emergence of memes have been approached from both a dynamical systems perspective, where criteria for viral spreading are derived, and also from a multi-agent representation. In this latter approach, we attempt to set the agent rules in line with empirically observed behaviour in social networks. Simulation is used to determine results for diffusion velocity and extent. However, with both approaches to modelling the emergence of memes, realistic network topologies must be assumed, as well as initial conditions for diffusion.

In this dissertation, we adopt a multi-agent approach to modelling the emergence of memes, taking into account realistic network topologies. To achieve this, we empirically derive the agent rule set from real world online social network data, incorporating both heterogeneous content and users. We validate our model by comparing the simulated viral diffusion dynamics and meme emergence with cases from real world data.

## Research Aims

\label{cha:ResearchAims}

## Research Questions

The thesis outlined in this dissertation is that we can simulate the emergence of memes within a social network. We arrive at this by first developing a framework to understand the relationships between network topology, growth and diffusion. We then develop a set of methods to infer additional information about both nodes and edges with data from real world social networks Facebook and Twitter. This leads to a definition of OSN users heterogeneous across content preferences and network position. We adopt a multi-agent system approach to simulating real world social network activity using empirically derived agent rule sets. Finally, we provide analysis of our simulated meme emergence, and compare to real world activity.

\parOur

research questions are as follows:

• RQ 1

How does the topology of a social network influence future link formation and content diffusion?

• RQ 2

What factors are predictive of heterogeneous online social network users engaging with different types of content?

• RQ 3

To what extent can we simulate the emergence of memes in a social network?

## Stakeholders

This research is of interest to a wide range of stakeholders due to the large number of users of OSNs and the multi-disciplinary nature of complex networks research. A better understanding of the interplay between network topology, link formation and content diffusion has a wide variety of applications. For instance, better link prediction models can improve the accuracy of connection recommendations in online social networks. More accurate content diffusion models, incorporating network topological effects, can benefit a large tranche of digital media and advertising companies as they can predict more accurately which content is more likely to go viral. Finally, an accurate methodology to predict the emergence of memes in a social network can benefit people studying human social behaviour, as well as online social network users, platform providers, and digital marketers and advertisers. \parThekey stakeholders of this dissertation belong to the following groups:

• Digital Marketers and Advertisers

• Online Social Network Providers

• Online Social Network Users

• Social Network Researchers

• Sociologists and Anthropologists

## Aims

The key aim of this research is to show how we can simulate the extent with which different types of memes will diffuse in different networks. Throughout this thesis, we show how this can aim can be achieved through constructing a multi-agent system which represents both the topology and activity dynamics of a real world social network. The aims of this thesis are as follows:

• Aim 1

Show how the topology of a social network affects future link formation and content diffusion.

• Aim 2

Show how we can model heterogeneous online social network users engaging with different types of content.

• Aim 3

Show how the viral diffusion of memes can be simulated in a social network.

## Objectives

The key objective of this research is to deliver a computational framework that can simulate the emergence of memes in a social network. This objective is achieved by first delivering modelling approaches to predict future link formation and content diffusion using measures derived from the network topology. We then deliver a statistical framework to model the likelihood of different types of OSN users engaging with heterogeneous content. Finally, we deliver a modelling approach to predict the virality of heterogeneous content in a social network, using a multi-agent system incorporating results from the first two objectives. The objectives of this thesis are as follows:

• Obj 1

Deliver modelling approaches to predict future link formation and content diffusion using features derived from the network topology.

• Obj 2

Deliver a statistical framework to predict which users will engage with different content, using a variety of graph topological measures and other features.

• Obj 3

Deliver a multi-agent system to simulate and predict the emergence of memes in a social network.

## Contributions to Knowledge

This dissertation offers several contributions to knowledge. Firstly, we explore the influence of network topology on the formation of future links and simple content diffusion. We extend this method by incorporating link strength, sign and further natures. Next, we propose a novel social network model based on the topological influences in real world networks, and provide an extensive review of the literature in this area.

Working with natural language captured from Facebook and Twitter activity, we then deliver a framework to infer key information regarding nodes and edges in the respective networks, and provide a model to predict which users engage with which topics and content. Finally, we use all of this research to construct a multi-agent system to simulate the diffusion of heterogeneous content across heterogeneous networks, and analyse the emergence of memes. \parTheknowledge contributions of this research are categorised as below:

• KC 1

A novel approach to link prediction in a social network using features derived from the topology of the network.

• KC 2

A novel approach to classify the properties of a network using link prediction importance measures.

• KC 3

A novel approach to extend the link prediction problem to incorporate link strength and sign, using real world data.

• KC 4

A novel multi-agent diffusion model for simple content sharing incorporating the influence of network topology, simulated on artificial and real world Twitter network data.

• KC 5

A review of the research regarding social network models, and a novel multi-agent model of a growing social network.

• KC 6

A novel modelling approach to estimate the likelihood for nodes in a social network to engage with heterogeneous content, using topological measures and inferred measures from real world Twitter data.

• KC 7

A novel multi-agent methodology to simulate and quantify the emergence of viral content, or memes, in a social network.

### Significance

Research into the inter-dependencies regarding social network structure and diffusion dynamics is important and of use to a wide range of stakeholders. Particularly, the emergence of memes is a topic important to many stakeholders, as the phenomena has the potential to engage large volumes of people. Social networks are deeply embedded in many aspects of modern society, from the interactions of agents in financial markets to the structure and dynamics of high school cliques, large organisations and even nations themselves. Some applications of this research include more effective targeting and content delivery for digital advertising, as well as a better user experience on online social media platforms. Sociologists and complex systems scientists may also apply this research to other data sets and networks. Further ahead, we believe that this research may be used to inform the design of organisational and social structures which optimise the flow of information, and may lead to more effective social policies at a government level.

The research questions and associated knowledge contributions benefit the stakeholders listed in many ways. The first research question asks how a network’s topology can be used to predict links and influence diffusion dynamics. We propose using a link prediction supervised learning model with a graph topological feature set. This has the advantaget that it can be applied to any network where the connection structure is known, so can be used as a benchmark in further research. This benefits complex and social network researchers by providing a comparision model. With regards to network diffusion, it is well known that a network’s structure can dramatically affect the diffusion dynamics (Newman, 2003). However, the exact nature of this dependency is still an open question. We aim to answer this question by quantitatively showing how the results of diffusion simulations using multi-agent systems vary with different network topologies. Answering this research question extends the existing literature on this topic, so social network researchers largely benefit.

The second question asks which factors are predictive of heterogeneous online social network users engaging with different types of content. This question is particularly significant to online social media platfoms, content creators, and digital advertisers. An answer to this question will provide these stakeholders with an increased understanding of the different types of OSN users, and how their preferences for media and content differ. This can be used to refine the targeting for advertisements, friend recommendations, and the type of content used in marketing material. The statistical framework proposed answers this question by learning a function which expresses the influence and importance of the factors on OSN users engaging content. This function can also be used to partition OSN users into heterogeneous groups. This will benefit complex system scientists, data scientists and any other researchers interested in apply the techniques to real world data. Also, sociologists and anthropologists will benefit from an increased understanting of the types of people using online social networks, and their content preferences.

The third question asks how accurately we can simulate the emergence of memes in a social network. We answer this question by modelling viral diffusion with a multi-agent system, where the agent rules are empirically derived from real world networks. Simulated results regarding viral diffusion are then compared to real world cases. This work benefits all of the stakeholders listed. Digital marketers and advertisers will be provided with an approach to test which types of content will go viral, controlling for how they are spread in a network. Online social network providers will benefit from a framework to compare activity taking place in their network with a simulation, and can use this to create more accurate content recommendation systems. Users of online social networks will benefit from having better content recommended to them. Social network researchers will have a novel and comprehensive study simulating real world activity in social networks, and sociologists and anthropologists benefit from an empirical study of the crowd behaviour using realistic agents representing individuals in OSNs.

## Literature Review

\label{cha:litreview}

The literature review presented here aims to provide a comprehensive survey of the key areas of research into social networks. Throughout this review, it is shown that the research question of how we can simulate or predict the emergence of memes in social networks is still open. Furthermore, there are many research gaps regarding the inter-dependencies between network topology and diffusion. Further research is also required to better understand network topology, particularly of large, real world online social networks.

The body of literature in this field is divided into four key areas of active research. Section \ref{lit_topology} delivers a review of measures and methods to classify the topological structure of a network, including techniques to identify influential nodes and approaches to detect community clustering structure. Some results on hierarchical community structure are also discussed. Section \ref{lit_SNM} provides a review of the literature concerning evolving models of network topology, and section \ref{lit_linkprediction} introduces the problem of link prediction and the main methods available. Aection \ref{lit_diffusion} presents two common approaches to modelling diffusion on social networks, followed by a review of the research regarding the prediction and simulation of memes.

## Network Topology

\label{lit_topology}

A network or graph, labelled by $$\mathcal{G}$$, is a set nodes (also known as vertices, agents, actors), $$\mathcal{V}$$ and a set of dyadic connections (edges, links, ties), $$\mathcal{E}$$ which join the nodes. The connections can be directed or undirected, and a variety of connection types can be chosen to give different graphs. In the language of graph theory, the connection structure between nodes of a graph can be represented by the adjacency matrix. For a finite graph $$\mathcal{G}$$ with $$n$$ nodes, the adjacency matrix is the $$n\times n$$ matrix with elements $$a_{ij}$$, which are non-zero if there is an edge between nodes $$i$$ and $$j$$. We then define properties of graphs such as vertex degree, or the number of edges for a given vertex, and the degree distribution of all vertices in the network. In a directed network, nodes can have both in-degree and out-degree in line with the edge direction. There are also many other properties of networks including distance (the number of edges traversed between nodes), network diameter (the maximum distance of the network), clustering and centrality measures, and other concepts such as paths. In particular, there are many definitions of centrality measures, such as betweenness, closeness and eigenvector centrality. Network models can be applied to a wide range of phenomena across multiple disciplines including physics, biology and computer science, however the focus for this review is on social networks.

The analysis of social networks began within the sociological discipline. Much of the work was descriptive and involved social science methodologies such as carefully designed behavioural surveys and the visual analysis of small network structures (Freeman, 2011). Most notable was the work of Travers et al. (1967), whose study produced empirical evidence of the ‘small-world’ effect: the popular concept of ‘six degrees of separation’, where the the network can be traversed in a surprisingly short number of steps. However, a new model of how social networks form was required to explain this result. Taking methods from the analysis of the physics of critical phenomena, Watts et al. (1998) proposed a network model where all nodes start with limited local edges to neighbouring nodes, then additional edges are assigned in a random fashion. The resultant network is highly clustered, meaning that nodes are densely connected together, and the network exhibits small characteristic path lengths; the ‘small-world’ effect.

In addition to the small-world effect, many real world social networks follow a power law degree distribution which the model of Watts et al. (1998) does not explain. Such a degree distribution is evident where the probability of a node having degree $$k$$ follows a power law, given by $$P(k)\sim k^{-\gamma}$$, where $$\gamma$$ is a constant. To address this, Barabasi et al. (1999) introduced the concept of preferential attachment, where nodes form edges preferentially to nodes that are already well connected, rather than randomly or locally to their neighbours. The resultant network, which they coined ‘scale-free’, follows a power law degree distribution and very closely matches the degree distribution of documents connected via hyperlinks on the World Wide Web. Although this model explains many real world networks, many others exhibit degree distributions other than power law and can be characterised by complex topological structure and community clustering (Newman, 2003).

Much of the more recent work relating to the analysis of network structure and topology can be categorised into two key areas, namely the idenfitication of influencial nodes according to a range of measures, and the detection of community structures within a network. Some interesting results have also come from analysing scaling properties of hierarchical networks.

### Influence and Centrality

\label{influencers}

A key area of research with regard to the topology of social networks is the identification of influencer nodes. Influencers in a social network can be defined as nodes that contribute significantly to the diffusion properties of the network, or are situated in positions of brokerage. For example, a node that has high degree is more likely to spread a virus, if infected, to a greater number of people than a node with low degree. This area is important to a range of applications that aim to better understand spreading behaviour, including innovation diffusion, viral marketing and epidemic control. In practice, commonly applied methods to identify influencers each convey distinct information for nodes in social networks. However, there is no consensus in the research community, so consideration must be given regarding which method to use (Pei et al., 2013). In particular, graph centrality measures provide a simple approach to score the influence of nodes over a network, and they form the basis for many methods. Furthermore, centrality measures have been shown to be useful as predictors in the link prediction problem due to the information they carry regarding the network topology and position of each node.

One of the earliest methods proposed for influencer identification is based on betweenness centrality, which measures the influence a node has over the spread of information in the network. It is calculated for a node as the fraction of shortest paths between other node pairs that pass through the chosen node. Newman (2005) proposed an influencer identification method that relaxes the shortest path restriction, instead using random walks to give how often the chosen node will fall on alternate paths between nodes. However, nodes that lie on shorter paths are effectively weighted up as there will be fewer alternative paths. Tested on a range of networks, this method generally gives more accurate solutions than shortest path betweenness measures, although the algorithm has a high computational cost. This cost is proportional to the number of nodes cubed, but can be overcome through utilising more optimal algorithms (Brandes, 2001; Brandes, 2008). As an alternative to the random walk calculation, Mantrach et al. (2010) propose a more sophisticated approach based on efficiently estimating a family of covariance measures between nodes on a directed, weighted graph. The key idea is that two nodes are correlated if they frequently both lie on the same trajectory. However, more work is required here to integrate these covariance measures into an efficient algorithm for influencer identification.

Another popular approach to identifying influencers is to use eigenvector centrality, which was originally applied in sociology as a measure of the influence of an individual in a social network (Pei et al., 2013). The eigenvector centrality for a node takes into account the centrality of its neighbours, as well as itself. This method is what Google’s well known search algorithm for the World Wide Web, PageRank, utilises to rank web pages according to the hyperlink network. However, Lu et al. (2011) apply PageRank to a dynamic online community in an attempt to identify community leaders and find it is not as accurate in a quickly evolving social network since the full model requires frequent recalibration. The authors develop their own eigenvector centrality method to identify influential nodes, LeaderRank, which self-adjusts to new information and outperforms PageRank in terms of accuracy. Another application of eigenvector centrality was by Weng et al. (2010) who develep and apply a method called TwitterRank to estimate the influence of users on Twitter. TwitterRank takes both the topic similiarity and connection structure betweeen users into account, and on a sample of data it outperformed the algorithm that Twitter uses.

The last approach worth mentioning here is the $$k$$-shell decomposition of the network, which differs from those mentioned in that it identifies sets of influencer nodes. It is been observed that under many circumstances, the most influential nodes may not necessarily be identified as those most connected or central to the network. This approach groups nodes into sets of $$k$$-shells, where each $$k$$-shell is a subgraph containing nodes with degree of at least $$k$$ (Pei et al., 2013; Probst et al., 2013; Basaras et al., 2013). In these cases, the best spreaders can be effectively identified as those situated in the core of the network, as given by $$k$$-shell decomposition (Kitsak et al., 2010). However, this approach has a number of drawbacks, including a high computational cost. The method also often fails to establish a monotonic relationship between $$k_{s}$$ and the total infected area, which can be given by degree based methods. Basaras et al. (2013) address this by creating a method that combines betweenness centrality and $$k$$-shell decomposition, and show that this hybrid method is more effective at identifying influential nodes and has less computational overhead than either of the component methods.

### Community Detection

\label{CDM}

Another key area for investigation in social network analysis is the study of how communities are structured. The term community is commonly used to refer to natural divisions of a network into densely connected subgroups, which are heterogeneous globally but homogeneous locally, according to some shared properties or roles in the graph. Fortunato (2010) provides a concise goal of this area when he comments that “the aim of community detection is to identify the modules and, possibly, their hierarchical organization, by only using information encoded in the graph topology” \citeyearpar{[}p. 78]Fortunato2010. Although community structure has been the topic of much research within the social sciences, the computerization of network analysis has allowed for the development of algorithmic approaches to community detection.

Plot of the modularity and dendrogram for a 64 vertex randomly generated community structured graph. The dotted red line indicates the point of maximum modularity, at which point the communities are extract; in this case, there are four, which are represented. Taken from Girvan et al. (2002).

Amongst the first seminal works in this area was an algorithm for community classification by Girvan et al. (2002), often referred to as the GN algorithm. The algorithm is divisive, so starts with one wholly connected community and iteratively divides the community by removing vertices with the lowest edge betweenness measures, which are recalculated at each step. Edge betweenness is similar to the betweenness centrality measures used for influencer identification, and represents how central an edge is. It is defined for edge $$i$$ by the number of shortest paths between vertices that run along $$i$$. The concept is that edges connecting communities will have high edge betweenness, so removing them will reveal the community structure. The authors also propose a measure for how good the community split is, the ‘modularity’ of the community partitions. This measure is related to the fraction of edges that fall within the same community. One can easily select the best community partitions by taking the split at the iteration that maximizes the modularity. Figure \ref{modularity} provides an example of the dendrogram of hierarchical community partitions, where the dotted red line represents the point of maximum modularity. The method produced reliable results when applied to real world network data, including social networks and the World Wide Web, as well as synthetically generated data where the community structure was known. However, the major detriment of the GN algorithm is that it has a high computational cost, which limits its application to large networks. The algorithm is also only applicable to undirected networks with unweighted edges.

To address the limitations in terms of computational cost of the GN algorithm, a number of methods have been proposed (Fortunato, 2010). One such algorithm is based on a hierarchical agglomeration (Clauset et al., 2004). The concept behind this approach is to gather vertices into groups such that there is a higher density of edges within groups than between groups. Optimisation of the modularity concept as defined by Girvan et al. (2002) will then identify the best community split. This algorithm runs in approximately linear time, relative to the size of the network, so allows for large scale community detection to be run efficiently. Indeed, the authors apply the method to a large network of product co-purchasing associations from the online retailer Amazon, and find good results. Despite this success, it is unclear how the results compare to other approaches as the algorithm is not tested and benchmarked on synthetically generated data where the community structure is known. As a side note, the authors comment that when a network is partitioned to the point of maximum modularity, the distribution of community sizes follows a power law. The reason behind this is an area for further research, but it is posited that the result could be a consequence of the sociology of the network, i.e. interest in various topics follows a power law. We will return to this point in Section \ref{hierarchy} where the research regarding hierarchical community structures is reviewed.

Following from these earlier community detection algorithms, a variety of methods have been developed, each with their advantages and disadvantages, however many give noticeably similar results. Newman (2013) addresses this by showing that under transformations using the eigenvectors of the adjacency matrix, termed spectral method formulations, with careful choice of free parameters, the communities classified by three different algorithms are identical. The theoretical results are verified over both real world data and computer generated data, however, the methods proposed are limited to discerning a dichotomous community structure only. Nevertheless, the analysis is an important step towards consolidating the myriad methods that have been developed for community detection by proposing equivalence between approaches.

All of the papers reviewed above are limited in that they only consider static networks, i.e. networks where the topology is fixed in time. Clearly, for these approaches to be of practical use in social networks the dynamics of communities must be considered. Pallo et al. (2007) outline the magnitude of this issue by stating that “our knowledge of the mechanisms governing the underlying community dynamics is limited, but is essential for a deeper understanding of the development and self-optimization of society as a whole” (Pallo et al., 2007). They open this research area by proposing a method that characterises community evolution by analysing the time dependence of overlapping communities identified using a ‘clique percolation’ algorithm. The authors apply this method to two real world data sets, one related to mobile phone calls, and the other to the collaboration network of scientists. They find that larger communities persist longer if membership is dynamic, while smaller communities, or cliques, are the opposite; their persistence is dependent on static membership. This study is an important one as it addresses a rather obvious question related to the utility of social network analysis, viz. how communities evolve over time.

## Community Structure

\label{hierarchy}

It was mentioned in the previous section that an analysis of the hierarchical structure of communities, as evident in the dendrograms produced by many community detection algorithms, may reveal interesting and as yet unexplained scaling laws. Guimera et al. (2003) explored this question by analyzing the hierarchical community structure produced by the GN algorithm applied to an email communication network from a university with around $$1,700$$ employees, as well as two computer-generated networks for validity. Their main approach involved a quantitative characterisation of the binary tree using a method originally introduced for the study of bifurcating river networks, the Horton-Stahler (HS) index. This showed that only the email communication network had a near constant bifurcation ratio, which implies that a self-similar community structure was only evident in the real world network. Further to this, given that self-similarity is evident in a similar way in both community structure and river networks, there may be a common principle of optimisation, either of water flow in rivers or information within communities acting as the underlying driving force for the emergence of the observed structure. This work is deeply interesting as it proposes that an underlying principle of self-organisation across hierarchical levels in social networks may evolve to optimise diffusion throughout the network.

Following the open research question left by Guimera et al. (2003) in terms of additional applications and verification of the presence and function of self-similarity in social networks, Hamilton et al. (2007) published an analysis of the complex structure of hunter-gatherer networks. They used a rare compilation of ethnographic data containing $$1,189$$ estimates of group size at multiple levels of organisation from a sample of 339 diverse hunter-gatherer societies. Similarly to Guimera et al. (2003), they utilised the HS index to construct bifurcation ratios between levels of aggregation. The results of this analysis show that the frequency of groups at each level of aggregation are statistically self-similar across the whole dataset. Furthermore, there was variation in the branching ratios by continent, suggesting that heterogeneous factors like geographical location have significant influence on the branching ratios. The main drawback from this publication is that there is potential bias in the data collection process that could have influenced the results. Nevertheless, the article has contributed an additional real-world application with interesting results to the existing body of research in this area.

Although there has been a lot of research produced around the analysis of hierarchical community structure within social networks, much of the work has proceeded along independent lines. For example, Song et al. (2005) produced a research article that was very similar to the investigation of Guimera et al. (2003), however it is clear from their short list of references to mostly seminal works, they did not incorporate much of the research already published. Nevertheless, their results demonstrate that four diverse, real world complex networks, including the social network of actor collaborations and the World Wide Web, consist of self-similar patterns on all scales. However, their methodology falls short in that they introduced a new method and did not validate the accuracy over a range of synthetically generated networks with known community structures and reference a benchmark model.

## Social Network Models

\label{lit_SNM}

Due to their complex and inter-connected nature, social networks are difficult to study with traditional analysis. Many mathematical methods with analytical solutions fail to account for complex, heterogeneous network structure. To address this, multi-agent models are increasingly being applied to produce realistic social networks by simulating the actions of a large number of connected agents following specific sets of rules. The aim of social network models is to generate a network which represents a real world social network topologically. The first key examples of this approach were the models of Watts et al. (1998) and Barabasi et al. (1999), however a great deal of work has been published since. This work has primarily been carried out within the physics community, and many of the models are simple enough for analytical results to be derived regarding the key topological properties of the resulting network. Numerical simulations are also commonly utilised to confirm the analytical results, or derive results which cannot be determined analytically. However, the models published to date utilise use rule sets not empirically validated on social networks, and the full range of network topologies, including community structure, have not always been evaluated.

There are two main types of social network models. Growth models aim to generate a social network by adding nodes connected with some probability at each time step, and rewiring models generally start with a fixed number of nodes and edges, and proceed to rewire edges to generate the target structure. For both these model types, there are a common set of agent rules and mechanisms deployed, as well as target network properties that are taken to represent a social network.

### Growth Models

The scale-free network model of Barabasi et al. (1999) has formed the basis of many growth models since. The model algorithm is defined as follows:

1. 1.

Initial Condition: Start with a network of $$m_{0}$$ nodes and $$e_{0}$$ edges

2. 2.

Growth: One node $$i$$ with $$m$$ edges is added at each time step

3. 3.

Preferential attachment: Each edge of $$i$$ is then connected to an existing node with probability proportional to its degree, i.e. the probability for the new node $$i$$ to be connected to an existing node $$j$$ is given by:

$$\Pi(k_{j})=\frac{k_{j}}{\sum_{i\in V}k_{i}},\\$$

where $$k_{i}$$ and $$k_{j}$$ represent the degree of nodes $$i$$ and $$j$$, respectively. $$V$$ represents the set of nodes, or vertices, of the network.

The linear preferential attachment mechanism as defined above by Barabasi et al. (1999) leads to a scale-free network with a power-law degree distribution, given by $$P(k)\sim k^{-\gamma}$$. The exponent $$\gamma$$ depends on the parameter $$m$$, and also multiple nodes are sometimes added to the network at each time step.

Due to the the linear preferential attachment mechanism being known to produce scale-free networks, a volume of research has been published with variations of it. These variations are usually introduced to produce additional topological features, such as high clustering and community structure. Kim (2002) extend the above model by including a triad formation step, aimed at producing higher clustering coefficients. After the preferential step is run, if an edge was created between nodes $$i$$ and $$j$$, a new edge is added between $$i$$ and a randomly selected neighbour of $$j$$, thus forming a connected triad. If there are no unconnected neighbours of $$j$$, the algorithm moves on to the next time step. The rationale behind this mechanism is a concept known as ‘sibling bias’, where an individual is more likely to know a friend of a friend. In this case, node $$i$$ is more likely to be connected to a friend of its new friend, node $$j$$. Guo et al. (2006) apply a similar concept, but replace the random attachment to a neighbour of $$j$$ with a preferential attachment mechanism based on the degree of the neighbour. In this modification, the neighbour with highest degree is most likely to be selected.

Another variation is to define a ‘local-world’, or ‘neighbourhood’ within the network which limits the connection range of new nodes. For instance, Li et al. (2003) define a local-world evolving network model, where $$M$$ nodes are selected randomly as the local-world for a newly added node. The effect of this modification is to truncate the power-law degree distribution to an exponential for large degrees, which is commonly observed. The linear preferential attachment mechanism is modified to only connect to the local-world, becoming:

$$\Pi_{local}(k_{j})=\Pi^{\prime}\left(j\in local\right)\frac{k_{j}}{\sum_{i,local}k_{i}},\\$$

where $$\Pi^{\prime}\left(j\in local\right)=M/(m_{0}+t)$$. Cao et al. (2006) adopt a similar approach, but with a more formal definition of a neighbourhood $$N_{p}(i)$$ with depth $$p$$,

$$N_{p}(i)=\{j\mid d_{ij}\leq p;j\in V,j\neq i\},\\$$

where $$d_{ij}$$ is the geodesic distance between nodes $$i$$ and $$j$$. The preferential attachment mechanism then becomes

$$\Pi_{N_{p}}(k_{j})=\Pi\left(j\in N_{p}\right)\frac{k_{j}}{\sum_{i,local}k_{i}},\\$$

where $$\Pi(j\in N_{p})=size(N_{p})/(m_{0}+t)$$. A further variation of this approach is from Guan et al. (2008), who introduce the concept of physical position as a node attribute. Each node is assigned a physical state vector, $$x_{i}\in\mathbb{R}$$, and the distance between two nodes is defined as $$d_{ij}=\left|{x_{i}-x_{j}}\right|$$. The physical neighbourhood of node $$i$$ with depth $$M$$ is then given by

$$N_{M}(i)=\{j\mid d_{ij}\leq d_{ik}\ ;\ \forall\ k\notin N_{M}(i),\ k\neq i,\ j\neq i,\ i,j,k\in V\}\\$$

All three of these models provide a tunable transition between power-law and exponential degree distributions. Li et al. (2003) show the degree distribution is modified by tuning the size of the neighbourhood between the two extremes of $$m\leq size(N_{p})\leq m_{0}+t$$. When $$size(N_{p})=2m+1$$ at the lower end, $$P(k)\sim\exp{-k/m}$$. When $$size(N_{p})=t+m_{0}$$, i.e. the neighbourhood is the whole network, we get a power-law distribution, $$P(k)\sim 2m^{2}/k^{3}$$. The results are similar for all three papers.

Several alternative connection mechanisms have also been proposed which give rise to scale-free networks. Saramaki et al. (2004) show that a network growth model where newly added nodes connect to the node which lies at the end of a random walk of length $$l$$. However, it is noted that this process resembles linear preferential attachment, as nodes with high degree are more likely to in the random path. Tanaka et al. (2008) use a weight driven preferential attachment mechanism, where edge weights are established for new nodes, or incremented for existing nodes. Power-law distributions are observed for degree, strength, and weight distributions, with exponents which can be controlled by modifying the model parameters.

Toivonen et al. (2006) demonstrate that a scale-free network can be constructed by a two step process of first randomly selecting a node, then randomly selected a neighbour of the first node. Again, a random neighbour of a random node is still more likely to be selected if it has higher degree, as it will have more neighbours. Nevertheless, Toivonen et al. (2006) show that their mechanism produces several of the target topological properties for social networks, including high clustering, power-law distributed degree distribution, and community structure. The latter property is evaluated by using a $$k$$-clique community detection method, which allows for overlapping communities. The authors also note that the community size distribution also follows a power-law. More recently, Yang et al. (2013) published a similar method which involves a mechanism connecting to random neightbours of arbitrarily chosen nodes. A scale-free network results from this model as well.

### Rewiring Models

Social network models involving rewiring of edges have received less attention than growth models, and research has appeared later. With rewiring models, additional consideration must be given to the existing connection structure of the network. The exact rewiring schemes, however, tend largely to follow preferential attachment, similarly to growth models, although alternative models have also been proposed. While some research has been published on rewiring models alone, the outlook for social network models appears to be towards models incorporating both growth and rewiring.

Lindquist et al. (2009) proposed a network evolution model through different rewiring schemes. Starting with an undirected and unweighted network with a fixed number of nodes and edges, two rewiring schemes with two mechanisms are proposed, yielding four separate cases. Firstly, the rewired edge can be detached from either end nodes, either the selected node or the neighbouring node. The authors utilise both random connection preferences, and linear preferential attachment. Contrary to growth models utilising preferential attachment, they find that power-law degree distributions are only achieved for specific average degree values in the case with linear preferential attachment, and the neighbouring node is detached.

Johnson et al. (2009) published a rewiring mechanism of preferential detachment and reattachment, however, nonlinear preferential functions are used. In this study on a network with a fixed number of nodes and edges, a node $$i$$ is first chosen with probability $$P(k_{i})\sim k_{i}^{\beta}$$. An edge of the selected node $$i$$ is randomly chosen and reconnected to node $$j$$ with probability $$\Pi(k_{j})\sim k_{j}^{\alpha}$$. The authors find that scale-free distributions emerge for a range of exponents only when $$\alpha=\beta$$, i.e. the probability of gaining or losing edges is the same.

A more recent study from Guang et al. (2012) involves a weighted preferential rewiring mechanism. Starting from a connected network with a fixed number of nodes and edges, and constant edge weights, the weight strength for node $$i$$ is defined as the sum of weights for all its edges, $$s_{i}=\sum_{j}w_{ij}$$. In this rewiring model, at each time step a random edge is selected and one connected node is fixed. The edge is then reconnected to another node by node strength preferential attachment,

$$\Pi_{i}=\frac{s_{i}}{\sum_{j}s_{j}}\\$$

The results of this study show that node strength is scale-free with exponent $$\gamma=-1$$, the edge weight distribution is exponential, and degree distribution is scale-free again with $$\gamma=-1$$.

### Ensemble Network Models

The three studies of rewiring models previously reviewed focus strongly on generating power-law distributed social networks, and largely ignore many of the other topological properties. Furthermore, these models fail to account for network growth. The network growth models in turn ignore changes in the network once new nodes have been added.

By combining growth and rewiring into one model, a more accurate representation of a social network may be achieved. Fan et al. (2009) published a model which incorporates five separate mechanisms that modify the node and edge structure of the network. Of all the network models reviewed, it is worth enumerating the full rule set for this model as the authors utilise a variety of mechanisms which modify the topology of the network. These mechanisms have been classified in italics. Starting with $$m$$ isolated local-worlds $$\Omega_{i}$$, $$i=1\ldots m$$, each with $$m_{0}$$ isolated nodes, at each time step one of the following steps is performed:

1. 1.

Community Growth: With probability $$p$$, a new local-world is created containing $$m_{o}$$ nodes and $$e_{0}$$ links.

2. 2.

Network Growth: With probability $$q$$, a new node is added to an existing local-world $$\Omega_{i}$$ with $$m_{1}$$ new links according to adjusted linear preferential attachment

$$\label{FanPi} \label{FanPi}\Pi(k_{i})=\frac{k_{i}+\alpha}{\sum_{j\in\Omega_{i}}(k_{j}+\alpha)},\\$$

where the parameter $$\alpha>0$$ is a constant which increases the likelihood of the new node $$i$$ to attract new links.

3. 3.

Community Structure: With probability $$r$$, $$m_{2}$$ new links are added to an existing local-world by randomly selecting a local-world and a node in the local-world, then connecting a new link according to equation (\ref{FanPi}).

4. 4.

Network Detraction: With probability $$s$$, $$m_{3}$$ links are deleted. The links are selected by randomly selecting an $$\Omega_{i}$$, then randomly choosing a node and deleting the link connecting to another node selected with probability

$$\Pi^{\prime}(k_{i})=\frac{1}{N_{\Omega_{i}}(t)-1}\left(1-\Pi(k_{i})\right),\\$$

where $$N_{\Omega_{i}}(t)$$ represents the number of nodes within $$\Omega_{i}$$ at time step $$t$$.

5. 5.

Community Bridges: With probability $$u$$, $$m_{4}$$ links are added between different local worlds by randomly choosing an $$\Omega_{i}$$, then randomly selecting a node in $$\Omega_{i}$$ with (\ref{FanPi}). We then choose a different $$\Omega_{j}$$ and connect the first node to a node in $$\Omega_{j}$$ chosen again with (\ref{FanPi}).

It is noted that $$0\leq p,q,r,s,u\leq 1$$, and $$p+q+r+s+u=1$$. The resulting complex network generated from this model has a degree distribution that is a mix of random, exponential and power-law, which the authors analytically derive this result using mean-field theory. The model is compared to real world data for the internet, and is shown to provide a closer degree distribution than the comparison models. However, no other topological features are evaluated, despite the formation of local-world communities being a primary motivation for this study. In addition, numerical simulations could provide some insight into additional measures, but are not conducted.

Judging from the research reviewed here, there is clearly a gap in the literature regarding multi-agent models of social networks. To address this gap, a consensus is required regarding the topological features that a social network model needs to achieve. A testing framework to evaluate how well a model succeeds in generating these features is also required. Given that OSNs provide empirical data sets which can be used to evaluate a model against, future studies can utilise this data to instead aid in defining the agent rules. Developing a prediction model for link formation is one approach which can accomplish this aim.

### Network Diffusion

\label{lit_diffusion}

The concept of diffusion on a social network is important due to its many practical applications. For instance, one could apply models of diffusion or propagation processes to social networks in order to better understand the dynamics of the spread of infectious diseases, the adoption of innovations and new products, or the spread of information. In recent years, the breadth of data now available regarding product transactions and social networks have opened opportunities to better understand diffusion, particularly from a marketing perspective (Luu et al., 2012). Despite this collection of data, it can still be very difficult to acquire information regarding the full topology of a social network due to privacy limitations. The research around diffusion modelling can be divided into two separate branches.

The first branch of research is referred to in this review as dynamical systems models, which relate to the modelling of interactions between groups of interest, typically by means of systems of differential equations. Examples of this include models of the spread of epidemics and the Bass model for the diffusion of innovations (Newman, 2003). These models do not require explicit knowledge of the network topology, but rather model the interactions between groups according to known diffusion parameters and statistical properties on a macro scale. Commonly, this approach will utilise nonlinear differential equations to capture the core dynamics. The other category of models are referred to as agent based diffusion models. This type of modelling investigates the behaviour of individual nodes, or agents, and requires much richer data regarding the network topology.

#### Dynamical Systems Models

Motivated by the work of Barabasi et al. (1999) on defining the class of scale-free networks, Pastor-Satorras et al. (2001) observed that this network model not only favours efficient communication, but also the spread of viruses. Traditionally, the spread of viruses have been modelled by the standard susceptible-infected-susceptible (SIS) epidemiolgical model. In this model, each node exists in one of two states: ‘susceptible’ or ‘infected’. Each suceptible node is infected with probability $$\nu$$ if it is connected to at least one infected node. Once infected, the node will recover with probability $$\delta$$ and move back to the susceptible state. From these parameters, a spreading rate is defined by $$\lambda=\nu/\delta$$. Commonly in epidemiological modelling, a critical threshold $$\lambda_{c}$$ can be identified such that if $$\lambda\geq\lambda_{c}$$, the virus will spread critically and persist throughout the network. Below the threshold, where $$\lambda<\lambda_{c}$$, the epidemic dies out exponentially fast. Through analysing data on real computer virus epidemics across the internet, which has been found to be scale-free (Newman, 2003), the authors find the absence of an epidemic threshold for critical spreading. This result indicates that the diffusion dynamics of computer viruses on scale-free networks are independent of spreading rate $$\lambda$$. However, the level of persistent infection throughout the network was found to be low.

One potential limitation of the previous work is that a power law degree distribtuion has been assumed across the network. Noting that this is not always the case in real world networks, Luu et al. (2012) define a multi-stage diffusion model that accounts for dynamic degree distributions. In particular, the authors develop separate diffusion models for networks with both power law and exponential degree distributions. To reiterate, a power law degree distribution is where the probability of a node having degree $$k$$ is distributed as $$P(k)\sim k^{-\gamma}$$, where $$\gamma$$ is a constant. In contrast, an exponential degree destribution is given by $$P(k)\sim e^{\frac{-k}{\lambda}}$$, where $$\lambda$$ is a constant. The work of Luu et al. (2012) also differs from Pastor-Satorras et al. (2001) since rather than a SIS diffusion model, the authors apply the Bass diffusion model of innovation in which the number of new adopters is modelled as a function of the total number of adopters and an adoption rate. Luu et al. (2012) develop their multi-stage diffusion model as a combination of Bass diffusion models on both exponential and power law degree distributions. This model is tested on both synthetically generated network data and a real world dataset, and is found to be more accurate. However, an efficient way to estimate the degree distribution of non-adopters is still required.

Further to the previous paper, Kim et al. (2013) produced a study aimed at modelling diffusion dynamics across multiple online media types. The research question asked relates to the significance of the interaction of different social media types on the diffusion of news stories. Similarly to Luu et al. (2012), the authors adopt a probabilistic dynamical systems model due to the absence of reliable data regarding the network topology. Using a data set of $$386$$ million web documents taken from one month in early $$2011$$, they take a much smaller sample of $$4.1$$ million documents connected via hyperlinks. They then extract key entities representing contextual news topics from these documents, and analyse the diffusion of the top $$63$$ news topics across social media types using a proposed dynamic influence model. The authors found that indeed, contextual information spreads at different rates across different social platforms. However, there are a number of potential problems with this methodology. Firstly, the spread of information is defined by the explicit sharing or replication of a hyperlink, which only accounts for one type of digital information transfer. Secondly, the authors chose the top $$63$$ news stories to be those that received the most attention, which is a biased sample and offers no information regarding what factors negatively affect diffusion.

Another concept that has gained popularity in social networks relates to the propagation and replication of ‘memes’. The concept of a ‘meme’ has its origins with Dawkins (1976), who defined them as units of mental content or cultural information that reproduce and evolve in a fashion similar to biological genes. With the introduction of online news, blogs and social networking sites, the empirical study of memes, or memetics, has received much attention in the social networks literature. For example, Leskovec et al. (2009) develop a framework for tracking the propagation and evolution of short phrases across online news and blog sites. The authors develop scalable algorithms to cluster textual variants of key phrases in quoted text across multipe articles, showing that this meme-tracking approach provides a coherent representation of the ‘news cycle’. With the set of memes automatically identified, threads of meme propogation are defined and a dynamical model is constructed to reproduce much of the variation. As a set of minimal requirements, it is found that a process of imitation, where threads that gain significant volume are more likely to persist through growth from other adopters, and favourable weighting for recency are sufficient to reproduce the observed dynamics.

A further application of interaction models to social networks can be found in the literature pertaining to the popularity of online content. Popularity involves competition between alternative items of content, and as such some interesting dynamics can result. For example, Ratkiewicz et al. (2010) provide a large scale analysis of the temporal dynamics of online content from two key data sources. The first is the full set of Wikipedia articles and their associated web traffic from $$2001$$ to $$2007$$, and the second is a yearly sequence of crawls on the Chilean Web. It is found that the popularity dynamics are characterised by bursts, and the popularity of articles or web pages follow a power law distribution. These characteristics are not readily reproduced by standard network models, however by introducing a re-ranking probability $$\rho$$ that randomly moves any given node to the highest rank, and combining with preferential attachment, the activity bursts and fat tailed popularity distribution are reproduced. Another approach is given by Gleeson et al. (2014), who find that empirically observed power law meme popularity distributions can be reproduced by a dynamical model where memes compete for the limited resource of attention. The authors dub this mechanism ‘Competition Induced Criticality’ since the resulting dynamics resemble those of ‘Self-Organised Criticality’ models applied to critical phenomena such as avalanches. This concept is intriguing as it implies that the critical spreading behaviour of online content may be akin to other critical phenomena, which exhibit interesting properties including scaling laws and fractality (Schroeder, 1991).

#### Multi-Agent Diffusion Models

In the previous section, we discussed approaches to modelling diffusion and spreading behaviours on networks where detailed information regarding the topology is not availble, so instead are approximated statistically. However, given richer data sets regarding individual nodes and edges in the network, we can apply multi-agent diffusion models. These models distinguish how individual behaviour relates to network structure on a granular level, and in many cases interesting global properties can emerge from the local dynamics.

A good introduction to the topic of multi-agent diffusion models is provided by Rahmandad et al. (2008). Due largely to the increase in computational power, many researchers are turning to agent based models to simulate diffusion processes. However, there is little concensus as to whether such models will provide better results than dynamical systems models as each have their strengths and weaknesses. For example, nonlinear differential equations can effectively incorporate a wide range of feedback mechanisms, but are limited to a small number of states that can be modelled. Furthermore, within each state nodes are assumed to be homogeneous and well mixed according to a standard distribution. Alternatively, multi-agent models can also include many types of feedback, but can be applied across heterogeneity in a number of network properties and individual attributes. The cost of this advantage is greatly increased computational complexity and a detailed knowledge of network structure and individual propensities. Rahmandad et al. (2008) find that out of the differences between model classes, potentially the greatest is that many interaction models provide a single forecast trajectory, while multi-agent models can generate a distribution of outcomes. The authors test the impact of individual heterogeneity and different network topologies on both classes of models in an epidemic context, and find that network structure significantly affects the diffusion dynamics.

#### Meme Prediction

Moving on to specific applications of agent models on real world data, Prakash et al. (2012) analyse threshold conditions for a virus to spread critically across a static network. However, in this case the full network topology is analysed to prove a general theorem which decouples the effects of the network topology and the virus propagation model. Prakash et al. (2012) also show that the threshold spreading rate depends only on the first eigenvalue of the connectivity matrix for any graph, and also for all standard propagation models. This theorem is proved both theoretically and also on real world data sets where the full adjacency matrix is calculated. This result has practical significance in controlling the outbreak of epidemics, as targeting a small number of influential nodes can dramatically alter the virus spreading widely. In this application, methods to identify influencer nodes such as eigenvector centrality, as discussed in Section \ref{influencers}, align well with the theorem. This result is significant, however it has only been proven on single virus propagation models over static networks. The theorem may not necessarily generalise to heterogeneity in either the diffusion process or the network structure.

Further to the comments of Rahmandad et al. (2008) regarding modelling large numbers of states, Weng et al. (2012) ask the question of why some memes go viral while others do not, and employ a multi-agent approach to reproduce observed diffusion dynamics. Using data from Twitter, Weng et al. (2012) identify a number of empirical regularities, including long-tailed distributions for meme lifetime, user activity and meme popularity. This points to very hetergeneous user sharing behaviours. In their diffusion model, a frozen network of agents is assumed where each agent maintains a time ordered list of posts about individual memes. With equal probability, the agents can generate a new post or share an existing post in their list, transmitting either to their neighbours. Limited attention is incorporated by restricting posts to sit in an agent’s list for finite time $$t_{w}$$, and competition amongst memes is achieved by tuning $$t_{w}$$ below $$1$$ for increased competition or greater than $$1$$ for more attention and less competition. After simulations of this model, Weng et al. (2012) find that in order to reproduce many characteristics observed empirically, including the long-tailed distributions of meme popularity and lifetime, the combination of social network structure and competition amongst memes are sufficient conditions.

Although the work of Rahmandad et al. (2008) included heterogeneity at the individual node level and different network topologies, the effect of community structure was not considered. In previous research, the viral propagation of memes is generally assumed to spread as an infectious disease, or simple contagion. In contrast, Weng et al. (2013) cite evidence that the spreading behaviour of memes is more akin to a complex contagion, where the effects of social reinforcement and homophily reinforce the spreading. In other words, diffusion within highly clustered communities which exhibit homogeneity across significant attributes, such as topical interests, is enhanced. This phenomena is referred to as social reinforcement, and when combined with structural trapping, i.e. where cross-community diffusion is dampened, the global diffusion dynamics can be dramatically altered. To test this hypothesis, Weng et al. (2013) develop a model to predict whether or not a meme spreads virally using data from two real world social networks and synthetic data sets. They find that the likelihood of a meme spreading virally is negatively correlated with meme concentration. The rationale behind this result is that if a meme spreads only to a highly concentrated set of nodes, it has likely saturated a community, but received little attention elsewhere in the network. In contrast, a meme with low concentration has likely spread further throughout the network and received attention from a wider range of communities.

## Research Methodology

The proposed research aims to model the emergence of viral content and memes in social networks. There are many methods, techniques and frameworks which are required to achieve this research aim. Firstly, the adopted modelling approaches are categorised within the fields of statistical machine learning and multi-agent systems. For the knowledge contributions around link prediction and the modelling of OSN user content engagement propensities, we are adopting methods from statistical supervised learning. Unsupervised learning methods will also be used to better understand OSN users based on their modelled engagement propensities. Furthermore, to handle data from real world OSNs, we will need to apply natural language processing to extract meaning from user comments and free text.

Our approach to modelling and simulating the dynamics of diffusion within a social network utilise multi-agent systems. The rule sets for these systems are derived through machine learning techniques. The networks over which these models are applied will be a combination of artificially generated graphs, and real world online social networks. To understand and classify the emergence of viral content and memes, we will use a variety of analysis, compare measures such as the velocity and extend of the diffusion to real world examples.

## Statistical Machine Learning

Statistical machine learning forms a large part of the modern field of machine or computer learning. For the purposes of this research, we utilise three key areas of machine learning. These are supervised learning methods, used to create prediction models for specific outcomes or targets. Unsupervised learning will be used to identify natural groupings in data. The third method is natural language processing, which is crucial to working with data from online social networks as much of the data is unstructured free text.

### Supervised Learning

Supervised learning describes the machine learning task of inferring a function which maps a set of inputs to a target output, or dependent variable, for a set of training data. The dependent variable or vector can be a binary flag, a categorical variable, or a numeric vector. There are a wide range of methods available for supervised learning, and the applicability of each depends on the type of dependent variable. A varying set of evaluation methods are also utilised, including measure the predictive accuracy, and applying the model over varying partitions and samples of data. This evaluation is used to determine broadly how accurate the model is, how robust it is to variations in data, and the importance of each of the features.

In this research, we utilise several supervised learning techniques. The first is the random forest model (Breiman, 2001), which we use for link prediction with graph topological measures. The dependent variable for this model is a binary flag for whether a link between two nodes exists or not. This technique effectively creates a large number of decision trees over different samples of the data, then combines all the individual results into one model. This is advantageous for the link prediction problem as more compex methods for data sampling are not required. The model is also very accurate and can be trained to effectively distinguish between unbalanced classes.

For our extension of the link prediction problem to include link sign and nature, as well as non-topological features, a range of supervised learning techniques will be applied and tested. The key requirements for these models are that they have high accuracy over unbalanced classes, provide feature importance scores, and have a reasonably low computational cost. This last requirement is generally satisfied for many regression techniques, such as the LASSO. Models such as the random forest and support vector machine can have high computational cost, but can also deliver higher accuracy then regression methods.

Finally, a key research component to this thesis is to create a model to describe and predict different types of OSN users levels of engagement with heterogeneous content. For this application, we propose using a Bayesian belief network. This type of model is suited to this task as we are likely to have a high dimensional data set. Bayesian techniques can handle this sort of data well. However, other models will also be applied to ensure that the most suited technique is used.

### Unsupervised Learning

Unsupervised learning is a type of machine learning task that involves drawing inferences from sets of data without a specified target variable. Unsupervised learning methods are commonly used in exploratory analysis to find hidden patterns or grouping in data. The process for applying an unsupervised learning technique is simpler than the supervised case as there is no consideration of a target response. However, performance evaluation is still important. This usually take the form of measures which quantity the amount of variation in the data captured by the groupings.

There are many different methods available for unsupervised learning. Perhaps the most common are cluster analysis techniques, such as k-means or hierarchical clustering. These methods are generally based on iterative algorithms which seek to reallocate observations to a set of clusters so that an evaluation function is minimised. In this research, we propose to use unsupervised learning to find natural groupings amongst OSN users with regard to their content preferences. These content clusters can be used to define preference communities within a social network, which form an input to the multi-agent modelling of content diffusion.

### Natural Language Processing

Natural language processing (NLP) is concerned with the interaction between human language and computers. NLP consists of a set of techniques to translate natural language into data structures that computers can interpret. There are several key categories of NLP, which range over a variety of use cases. For this research, we will be processing unstructured free text data from online social network platforms. The key types of NLP we will apply belong to the categories of sentiment analysis, named entity recognition, topic recognition and natural language understanding.

Sentiment analysis refers to techniques that aim to understand the sentiment of a piece of text with respect to a topic or named entity. Named entity recognition determines which items in a text to map to names, and what those names relate to. Topic recognition similariy attempts to identify topics that are referenced, and segment the text into pieces that deal with each topic. Natural language understanding is perhaps the most complex of these areas, and attempts to identify the intended semantic, or meaning from the text, from the multiple possible semantics.

### Graph Feature Extraction

The proposed research aims to use features derived from the topology of a network in many different ways. Topological feature sets are used for the link prediction problem, as well as in designing multi-agent systems and the analysis of viral diffusion and meme propagation. Novel approaches to the extraction of feature sets in graphs are developed in this thesis.

The literature and body of theory related to graphs defines numerous measures, algorithms and techniques to extract information regarding the network topology. In this thesis, we utilise existing research to formulate a wide range of graph features to be used for modelling. The features used relate to network centrality and measures of influence, community structure and classification, and spectral analysis and projections. We extend the state-of-the-art by further evaluating the influence of each feature set on the dynamics of diffusion.

## Multi-Agent Systems

Multi-agent systems are computerised models of the interactions between multiple agents and their environment. These models have particularly high utility in social networks, since a social network is a network of interacting agents. Generally, the process of developing a multi-agent system starts with a definition or derivation of the agent rules, which govern the interactions between agents. The system is then run in simulation and the results are analysed.

Multi-agent systems are particularly useful to model the evolution of social networks, and diffusion within the network. Through simulating the interactions between agents, we can explore global emergent properties in a network.

### Social Network Models

Multi-agent models have been used extensively in social network analysis due to their ability to incorporate complex graph structure and agent behaviour. Their use goes back to the small-world and scale-free network models of Watts et al. (1998) and Barabasi et al. (1999), respectively. Multi-agent models have since been applied extensively to create artificial networks with real world properties.

A key challenge with using multi-agent models to simulate a social network’s evolution is in defining the rule set. This rule set can define growing mechanisms for how new nodes connect to the existing network, and rewiring or removal of existing connections. Following from the small-world and scale-free models, most research in this area has focused on tuning the scale-free model (Kim, 2002; Guo et al., 2006), or producing scale-free networks with other methods such as random walks and alternative growth mechanisms (Saramaki et al., 2004; Yang et al., 2013).

We propose extending the existing literature by firstly providing a review of social network multi-agent models, which is currently lacking. Secondly, we develop a multi-agent model for specific online social networks using link prediction models to derive the agent rule set empirically.

### Diffusion Models

Much of the early research around diffusion in social networks utilised dynamical systems models. However, most of these models do not take into account the exact topology of the network. Network effects clearly can have a dramatic influence on diffusion (Rahmandad et al., 2008). To address this, multi-agent models have been proposed where agents interact across the network topology.

Again, the key challenge with multi-agent diffusion models is in defining the rule set. Generally, this rule set is more complex than an evolving social network model. Previous research has shown that agent rules can incorporate limited attention (Weng et al., 2012) and the influence of neighbours in a complex contagion (Barash et al., 2012). We aim to extend this research by creating a multi-agent diffusion model which incorporates more topological features, and agent rules for content preference. These rule sets will be derived from real world data using both link prediction and content engagement likelihood models.

### Data Applications

The data applications for this research represents a novel contribution to the existing literature. The methodology developed will be applied to artifically generated networks, and also real world data from Twitter. R statistical software is used for all statistical work, and we make extensive use of the iGraph package (Csardi et al., 2006). Multi-agent systems will be programmed in combination of R and Python. Methods to identify the spread of memes and viral content in Twitter and Facebook are also developed.

#### Artificial Networks

There are three key types of artificial networks used in this thesis; the Erdős and Rényi random graph (Erdős et al., 1957), the small-world model of Watts et al. (1998) and the scale-free model of Barabasi et al. (1999). All are available within the iGraph package in R.

#### Real World Networks

Data regarding users and activity from the social media platforms Twitter and Facebook will be extracted over the period 1 November 2014 to 31 May 2015. Details regarding volume and available data will be provided once data has been extracted. This data is provided by social media analytics company Digivizer.

## Ethics and Risks

\label{cha:ethics}

## Research Plan

\label{cha:ResearchPlan}

## Timeline

Year 2014 Activities Output [leftmargin=+0.5cm] • Technology Research Preparation • Technology Research Methods • Literature review • Development of research methodology [leftmargin=+0.5cm] • Successful completion of TRP • Successful completion of TRM • First draft literature review completion • Investigation into research methodology [leftmargin=+0.5cm] • Develop topological link prediction method • Develop extended link prediction method • Develop network property classification method using link prediction • Review of social network models literature • Source real world data [leftmargin=+0.5cm] • Draft chapter on link prediction methods • Second draft literature review completion • Real world data sets sourced • Candidature assessment completion [leftmargin=+0.5cm] • Develop evolving social network model • Develop multi-agent diffusion model for homogeneous content • Develop content engagement model for OSN users [leftmargin=+0.5cm] • Draft chapter on evolving social network model • Draft chapter on multi-agent diffusion model • Draft chapter on content engagement model [leftmargin=+0.5cm] • Develop methods to identify and analyse memes and viral content in real world data • Develop multi-agent diffusion model with heterogeneous content • Run multi-agent simulations for viral content and meme emergence • Develop methods to analyse and compare viral content and meme emergence to real world cases [leftmargin=+0.5cm] • Draft chapter on viral content and meme emergence • Finalisation of all chapters • Entire dissertation completed and submitted.

This research is proposed to be completed and delivered on the $$31^{st}$$ of December, $$2017$$, according to the timeline outlined in Table \ref{timeline}.

### Resource Requirements

Although this research relies heavily on data processing algorithms and technology, much of this already exists and is available at UTS, or will be created during the research.

#### Software

The largest portion of the work required for this research will be in writing and implementing software applied to large data sets. This is to be done primarily with the R software language and Python programming language. R and Python are both open source so there is no additional software cost for this project.

#### Hardware

It is anticipated that many of the real world OSN data sets and simulated networks used in this research will be very large. In addition, many of the techniques which will be applied to these data sets have a heavy computational cost. Access to the UTS High Performance Computing Linux Cluster is therefore required to carry out this research.

## Progress to Date

\label{cha:Progress}

In agreement with the research timeline presented in Table \ref{timeline}, the following tasks have been completed.

## Literature Review

The Technology Research Preparation course, subject number $$32144$$, was completed in Semester $$1$$, $$2014$$ with a mark of $$94$$ and grade of High Distinction. An output of this course is a draft literature review for the thesis. In preparation for the candidature assessment, this draft has been extended to include the areas of social network models and link prediction. The literature review will need to be constantly updated until this dissertation is finished, as the research area is very active and new papers are frequently being published. Nevertheless, the literature review to date has identified gaps in the research and has informed the proposed research questions. Furthermore, there is a gap regarding published reviews of the literature for social network models. We aim to address this gap with a review publication, largely based on the literature review which has been completed.

## Research Questions and Methodology

The Technology Research Methods course, subject number $$32931$$, was completed in Semester $$2$$, $$2014$$ with a mark of $$83$$ and grade of Distinction. The output of this course was an early draft for the candidature assessment document. This has been modified and extended to cover additional work completed since this course.

## Development of link prediction methods

A theoretical framework and R software implementation has been developed for the first knowledge contribution regarding link prediction using topological features. A paper entitled ‘Link Prediction and Topological Feature Importance in Social Networks’ was submitted for review to the 2015 Australasian Data Mining Conference, and has been accepted. In this paper, a novel approach to link prediction using topological graph features with a random forest learning model was developed. Random forest variable importance scoring was also used to evaluate properties of each network. The approach was applied to three real world data sets, the citation networks Cora, Citeseer and WebKB, as well as three artificially generated networks. This paper forms the basis for the first thesis chapter on link prediction methods.

Work is currently underway to publish an R package of this implementation, and extend the theoretical framework to include additional link characteristics.

### References

1. Erdős, Rényi. On random graphs. Publicationes Mathematicae 6, 290–297 (1957).

2. Jeffrey Travers, Stanley Milgram. An experimental study of the small world problem. Sociometry 32, 425–423 (1967).

3. Duncan J. Watts, Steven H. Strogatz. Collective dynamics of ’small-world’ networks. Nature 393, 440–442 (1998).

4. A. Barabasi, R. Albert. Emergence of scaling in random networks. Science 286, 509–512 (1999).

5. M. Girvan, M. E. J. Newman. Community structure in social and biological networks. Proceedings of the National Academy of Sciences 99, 7821–7826 (2002).

6. Richard Dawkins. The Selfish Gene. Oxford University Press, 1976.

7. Lilian Weng, Filippo Menczer, Yong-Yeol Ahn. Predicting successful memes using network and community structure. In Proceedings of the Eighth International AAAI Conference on Weblogs and Social Media. (2014).

8. M. E. J. Newman. The structure and function of complex networks. SIAM Review 45, 167–256 (2003).

9. L. C. Freeman. The sage handbook of social network analysis. 26–39 Sage Publications Ltd, 2011.

10. Sen Pei, Hernan A. Makse. Spreading dynamics in complex networks. Journal of Statistical Mechanics: Theory and Experiment 12, 1–21 (2013).

11. M. E. J. Newman. A measure of betweenness centrality based on random walks. Social Networks 27, 39–54 (2005).

12. Ulrik Brandes. A faster algorithm for betweenness centrality. Journal of Mathematical Sociology 25, 163–177 (2001).

13. Ulrik Brandes. On variants of shortest-path betweenness centrality and their generic computation. Social Networks 30, 136–145 (2008).

14. Amin Mantrach, Luh Yen, Jerome Callut, Kevin Francoisse, Masashi Shimbo, Marco Saerens. The sum-over-paths covariance kernel: a novel covariance measure between nodes of a directed graph. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 1112–1126 (2010).

15. Linyuan Lu, Yi-Cheng Zhang, Chi Ho Yeung, Tao Zhou. Leaders in social networks, the delicious case. PLoS ONE 6, 1–9 (2011).

16. Jianshu Weng, Ee-Peng Lim, Jing Jiang, Qi He. TwitterRank: finding topic-sensitive influential Twitterers. ACM International Conference on Web Search and Data Mining 261–270 In ACM International Conference on Web Search and Data Mining. (2010).

17. Florian Probst, Laura Grosswiele, Regina Pfleger. Who will lead and who will follow: identifying influential users in online social networks. Business and Information Systems Engineering 3, 179–193 (2013).

18. Pavlos Basaras, Dimitrios Katsaros, Leandros Tassiulas. Detecting influential spreaders in complex, dynamic networks. Computer 46, 24–49 (2013).

19. Maksim Kitsak, Lazaros K. Gallos, Shlomo Havlin, Fredrik Liljeros, Lev Muchnik, H. Eugene Stanley, Hernan A. Makse. Identification of influential spreaders in complex networks. Nature Physics 6, 888–893 (2010).

20. S. Fortunato. Community detection in graphs. Physics Reports 486, 75–174 (2010).

21. A. Clauset, M. E. J. Newman, C. Moore. Finding community structure in very large networks. Physical Review E 70, 1–6 (2004).

22. M. E. J. Newman. Spectral methods for community detection and graph partitioning. Physical Review E 88, 1–10 (2013).

23. G. Pallo, A. L. Barabasi, T. Vicsek. Quantifying social group evolution. Nature 446, 664–667 (2007).

24. R. Guimera, L. Danon, A. Dıaz-Guilera, F. Giralt, A. Arenas. Self-similar community structure in a network of human interactions. Physical Review E 68, 1–4 (2003).

25. Marcus J. Hamilton, Bruce T. Milne, Robert S. Walker, Oskar Burger, James H. Brown. The complex structure of hunter-gatherer social networks. Proceedings B 274, 2195–2202 (2007).

26. Chaoming Song, Shlomo Havlin, Hernan A. Makse. Self-similarity of complex networks. Nature 433, 392–395 (2005).

27. Petter Holme AND Beom Jun Kim. Growing scale-free networks with tunable clustering. Physical Review E 65, 1–4 (2002).

28. Qiang Guo, Tao Zhou, Jian-Guo Liu, Wen-Jie Bai, Bing-Hong Wang, Ming Zhao. Growing scale-free small-world networks with tunable assortative coefficient. Physica A 371, 814–822 (2006).

29. Xiang Li, Guanrong Chen. A local-world evolving network model. Physica A 328, 274–286 (2003).

30. Y. J. Cao, G. Z. Wang, Q. Y. Jiang, Z. X. Han. A neighbourhood evolving network model. Physics Letters A 349, 462–466 (2006).

31. Zhi-Hong Guan, Zheng-Ping Wu. The physical position neighbourhood evolving network model. Physica A 387, 314–322 (2008).

32. Jari Saramaki, Kimmo Kaski. Scale-free networks generated by random walkers. Physica A 341, 80–86 (2004).

33. Takuma Tanaka, Toshio Aoyagi. Weighted scale-free networks with variable power-law exponents. Physica D 237, 898–907 (2008).

34. Riitta Toivonen, Jukka-Pekka Onnela, Jari Saramaki, Jorkki Hyvonen, Kimmo Kaski. A model for social networks. Physica A 371, 851–860 (2006).

35. Xu-Hua Yang, Shun-Li Lou, Guang Chen, Sheng-Yong Chen, Wei Huang. Scale-free networks via attaching to random neighbours. Physica A 392, 3531–3536 (2013).

36. Jennifer Lindquist, Junling Ma, P. van den Driessche, Frederick H. Willeboordse. Network evolution by different rewiring schemes. Physica D 238, 370–378 (2009).

37. Samuel Johnson, Joaquin J. Torres, Joaquin Marro. Nonlinear preferential rewiring in fixed-size networks as a diffusion process. Physical Review E 79, 1–3 (2009).

38. Chen Guang, Yang Xu-Hua, Xu Xin-Li. Weighted scaling in non-growth random networks. Communications in Theoretical Physics 58, 456–462 (2012).

39. Zhengping Fan, Guanrong Chen, Yunong Zhang. A comprehensive multi-local-world model for complex networks. Physics Letters A 373, 1601–1605 (2009).

40. Naoki Shibata, Yuya Kajikawa, Ichiro Sakata. Link prediction in citation networks. Journal of the American Society for Information Science and Technology 63, 78–85 (2012).

41. Linyuan Lu, Tao Zhou. Link prediction in complex networks: a survey. Physica A 390, 1150–1170 (2011).

42. Lada A. Adamic, Eyton Adar. Friends and neighbors on the web. Social Networks 25, 211–230 (2003).

43. David Liben-Nowell, Jon Kleinberg. The link-prediction problem for social networks.. Journal of the American Society for Information Science and Technology 58, 1019–1031 (2007).

44. Susanna Zaccarin, Giulia Rivellini. Modelling Network Data: An Introduction to Exponential Random Graph Models. 297–305 In Data Analysis and Classification. Springer Berlin Heidelberg, 2010.

45. Aaron Clauset, Christopher Moore, M. E. J. Newman. Hierarchical structure and the prediction of missing links in networks. Nature 453, 98–101 (2008).

46. Xue Zhang, Xiaojie Wang, Chengli Zhao, Dongyuan Yi, Zheng Xie. Degree-corrected stochastic block models and reliability in social networks. Physica A 393, 553–559 (2014).

47. Lars Backstrom, Jure Leskovec. Supervised random walks: predicting and recommending links in social networks. 635–644 In WSDM Conference. (2011).

48. Dashun Wang, Dino Pedreschi, Chaoming Song, Fosca Giannotti, Albert-Laszlo Barabási. Human Mobility, Social Ties, and Link Prediction. 1100–1108 In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2011. Link

49. C.A. Bliss, M.R. Frank, C.M. Danforth, P.S. Dodds. An evolutionary algorithm approach to link prediction in dynamic social networks. Journal of Computational Science 5, 750–764 (2014).

50. Peng Wang, BaoWen Xu, YuRong Wu, XiaoYu Zhou. Link prediction in social networks: the state-of-the-art. Science China 58 (2014).

51. Minh-Duc Luu, Ee-Peng Lim, Tuan-Anh Hoang, Freddy Chong Tat Chua. Modeling diffusion in social networks using network properties. 218–225 In Sixth International AAAI Conference on Weblogs and Social Media. (2012).

52. Romualdo Pastor-Satorras, Alessandro Vespignani. Epidemic spreading in scale-free networks. Physical Review Letters 86, 3200–3203 (2001).

53. Minkyoung Kim, David Newth, Peter Christen. Modeling dynamics of diffusion across heterogeneous social networks: news diffusion in social media. Entropy 15, 4215–4242 (2013).

54. Jure Leskovec, Lars Backstrom, Jon Kleinberg. Meme-tracking and the dynamics of the news cycle. 497–505 In 15th ACM SIGKDD international conference on Knowledge discovery and data mining. (2009).

55. Jacob Ratkiewicz, Santo Fortunato, Alessandro Flammini, Filippo Menczer, Alessandro Vespignani. Characterizing and modeling the dynamics of online popularity. Physical Review Letters 105, 1–4 (2010).

56. James P. Gleeson, Jonathan A. Ward, Kevin P. O’Sullivan, William T. Lee. Competition-induced criticality in a model of meme popularity. Physical Review Letters 112, 1–5 (2014).

57. Manfred Schroeder. Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise. W. H. Freeman and Company, 1991.

58. Hazhir Rahmandad, John Sterman. Heterogeneity and network structure in the dynamics of diffusion: comparing agent-based and differential equation models. Management Science 54, 998–1014 (2008).

59. Aditya Prakash, Deepayan Chakrabarti, Nicholas Valler, Michalis Faloutsos, Christos Faloutsos. Threshold conditions for arbitrary cascade models on arbitrary networks. Knowledge Information Systems 33, 549–575 (2012).

60. L. Weng, A. Flammini, A. Vespignani, F. Menczer. Competition among memes in a world with limited attention. Scientific Reports 2, 1–7 (2012).

61. Lilian Weng, Filippo Menczer, Yong-Yeol Ahn. Virality prediction and community structure in social networks. Scientific Reports 3, 1–6 (2013).

62. Leo Breiman. Random forests. Machine Learning 45, 5–32 (2001).

63. Vladimir Barash, Christopher Cameron, Michael Macy. Critical phenomena in complex contagions. Social Networks 34, 451–461 (2012).

64. Gabor Csardi, Tamas Nepusz. The igraph software package for complex network research. InterJournal Complex Systems, 1695 (2006). Link