_Keywords:_ meme, word2vec, clustering, Twitter INTRODUCTION LITERATURE REVIEW There are two key sets of techniques for clustering tweets that have been proposed in the literature. The first involves extracting ‘protomemes’ from the tweets, which are atomic collections of hashtags, user mentions, URLs and phrases . The method involves constructing projections of the protomemes to the tweet, user, content and network levels. Protomeme pair similarity matrices are constructed for each projection, which are then aggregated into one similarity matrix to cluster. The protomeme clusters are assumed to represent memes present in the data. This method has since been extended to handle large amounts of social media data in a streaming fashion, as well as identify new protomeme clusters emerging with new data . The other set of techniques involve neural network based embedding models, following on from the word2vec model of . Several variations of these methods have been developed. For instance, word2vec embedding models have been applied to classify whether tweets were related to the 2015 Venezuelan election . The results of this study showed that a word embedding feature set did not provide a lift in the F1 score over a TF-IDF benchmark feature set when using a support vector machine model. However, demonstrate improved performance by using a convolutional neural network with the word embedding feature set. Further developments in the word embedding family of models include paragraph2vec , and character level representations. A recent study coined the method ‘tweet2vec’, which produces vector space embeddings for whole tweets by modelling character sequences . However, while word2vec and paragraph2vec are both unsupervised techniques, tweet2vec is trained in a supervised manner to predict hashtags in the tweet. These methods thus assume that hashtags alone are a good representation of a tweet. applied a hierarchical clustering technique to tweet2vec embeddings trained on the SNOW 2014 data set, with results generally outperforming the state-of-the-art. METHOD In this paper, we present a new method to find semantically and topically similar clusters of tweets. This method extends the techniques published recently in the literature, and we demonstrate an improvement in performance. Evaluation Data We evaluate our method, along with a number of benchmark methods, on two Twitter data sets collected through Twitter’s public real time API. The first data set consists of tweets extracted from six different search terms over a period of time. Tweets from each stream are taken from particular dates. We take the six sets of tweets to be ground truth labels, and evaluate the clustering methods on these labels. We have removed retweets and randomly sampled 2, 000 tweets from each stream. Details of the search terms and dates for each stream are given in Table [tab:1]. Note that by providing a term to the Twitter search API, results for all tweets containing that search will be returned, including hashtags and user screen names (e.g. trump retrieves #trump, @realDonaldTrump and Trump). In this data set, we use the original search term as the ground truth label for each tweet. Since the search term in the text would bias some algorithms over other, we remove the respective term before running our approach. | c | m3.4cm | c | SEARCH TERM & DESCRIPTION & DATE auspol & Used for Australian opinion polling & 26 Jul 2017 qanda & Used for a panel discussion TV show in Australia & 11 Dec 2017 trump & Tweets associated with Donald Trump & 24 Nov 2017 btc & Tweets associated with bitcoin & 23 Jan 2018 castro & Tweets associated with Fidel Castro, after his death & 27 Nov 2016 sydneytrains & Tweets associated with Sydney Trains, after they announced a strike & 25 Jan 2018 The second data set is centred around the popular ‘QandA’ program run by the Australian Broadcasting Corporation. This data set was extracted from the Twitter real time API on the 4th of December, 2017, by filtering for the search term ‘qanda’ during the episode’s 70 minute time slot. Again, we have removed all retweets from the data. To provide a ground truth label, 1, 418 tweets from the full set of 7, 137 were manually labelled into 23 distinct topics. This manual labelling was performed in reference to the seven questions asked to the panel by the audience. Data Preprocessing To prepare our data for the different clustering methods, we first remove the search terms for the respective set of tweets so that these terms are not used in the clustering. Stop words and punctuation are then removed, and we tokenise the tweets based on spaces, ensuring that hashtags and user mentions are kept as tokens. URLs have been removed from the tokenisation. Protomeme Representations We start with an implementation of the protomeme clustering method of to form a baseline. This method incorporates the key atomic entities used in Twitter to denote topics and content groups, namely hashtags, URLs, user mentions and phrases. Furthermore, projecting the protomemes to the levels of user, tweet, content, network and phrases capture both the content and network aspects of Twitter, which other methods can lack. In line the method of , we extract all hashtags, user mentions and URLs from the text as protomemes. However, while include the phrase in the tweet as a protomeme (i.e. all textual content once hashtags, user mentions and URLs have been removed, and after stemming), we instead use frequently occurring bigrams as the textual content protomeme. Our rationale for this change is that the full phrase is unlikely to be repeated many times through the data set, whereas some bigrams are likely to be frequent. Furthermore, bigrams can often indicate key people or other nouns, so can capture similar information to hashtags. Once we have the protomemes extracted, we construct the protomeme projection matrices. The tweet and user projections are constructed as co-occurrence matrices for the protomeme and the user or the tweet. The content projection is defined by as a normalised term frequency matrix for the group of tweets associated with each protomeme. In this paper, we also experiment with a paragraph2vec tweet representation which is calculated as the mean paragraph2vec representation for all tweets which the protomeme occurs in. The network projection is defined as the protomeme co-occurrence with respect to the list of all users, retweeting users and user mentions for tweets which contain the protomeme. Finally, the bigram protomeme projection is calculated as the bigram to protomeme co-occurrence matrix. The next step in the method of is to create similarity matrices for all protomeme pairs across the projection matrics. These are then aggregated using a variety of methods, however the authors note that taking the maximum similarity for each protomeme pair across the projections performs well. Since we are clustering the aggregated matrix, we modify this method by instead calculating cosine distance matrices for each protomeme pair in each projection, then taking the pairwise minimum across the projections. Once we have the protomeme clusters (see Section [clustering] for details on the clustering approach), we need to map the protomeme clusters back to the original tweets for evaluation. The relationship between tweets and protomemes is many to many, and one tweet can have multiple protomemes. do not resolve their clustering to the tweet level, but instead allow for tweets to have multiple clusters. Our methods require unique tweet clusters, so we address the issue by calculating the centroids of each protomeme cluster with respect to each of the protomeme projection matrices. We then calculate the minimum distance between each protomeme and cluster centroid for each projection, and calculate a mean squared distance over the projections. This gives a matrix of mean squared distances for each protomeme and cluster. To map this matrix to the tweet, we simply take the mean of this matrix for all the protomemes in the tweet. Pargraph2vec Embedding One contribution in this paper is to use a pre-trained paragraph2vec representation for tweets as a component of the clustering matrix. There are several considerations with training these models, such as how many dimensions to use, the context window, the type of algorithm, and the size of the training corpus. To explore the effect of these hyperparameters, we test multiple paragraph2vec models with dimensions (d) of 100, 300 and 1000, and context window lengths of 3, 6, and 9. We also explore the effect of pre-training the model over a large historical data set of related tweets, and training only on the clustering sample. Recent literature has provided evidence that a larger pre-trained model can deliver vector representations that result in higher accuracy in classification tasks . Our larger historical data set consists of the same Twitter search queries over a much longer period of time, totalling over 3 million tweets _[I need to extract this data and train, also yet to set up all the hyperparameter tests]_. Composite Approach While using paragraph2vec representations can provide better results in some tasks, we have found with clustering tweets with the embeddings used as features often does not separate the tweets into clear clusters. One reason for this is likely the short and noisy nature of Twitter. Additionally, this could also be due to the way that some tokens, such as hashtags, user mentions and URLs, are very unbalanced in terms of the information they carry. One approach to address this issue is to effectively up-weight the effect of these terms that carry more information. Many of these terms would correspond to protomemes, while other terms might get picked up by the top weighted terms from a term-frequency inverse document frequency (TF-IDF) matrix. To investigate the effects of these methods, we introduce two new clustering approaches based on composite observation matrices. The composite matrices are both based on the paragraph2vec representations of d dimension, with another matrix concatenated. The first composite approach uses the tweet level mean of the protomeme to cluster centroid distance matrix, and the second uses a top n term TF-IDF matrix. Our hypothesis is that the inclusion of these additional matrices which capture protomeme clusters or high TF-IDF terms will up-weight these terms as required to produce a better clustering. Clustering Framework Given that each of the methods we investigate yield quite different clustering matrices, we experiment with three clustering methods. These are K-means clustering, hierarchical clustering (agglomerative), and DBSCAN, which we implement in python’s sklearn libraries. Both K-means and hierarchical clustering require the number of clusters to be specified. DBSCAN has been included since it is known to create clusters of arbitrary geometric shape. Given that a character level embedding feature set only provided classification uplift when a CNN model was used , the clustering performance of DBSCAN is worth exploring. In the case of our data, we expect the number of clusters to be the same as the number of search queries. For comparison across all of our methods, we only include tweets that have protomemes present. _[I’ve only run hierarchical so far]_ For the protomeme clustering, we pass a precomputed distance matrix with the ‘complete’ linkage method to the _AgglomerativeClustering_ method. All other hierarchical clustering is run using the observation matrices with the default parameters. Evaluation We compare the results of these clustering methods to the search query data set with the search query label as a ground truth, and also to the QandA data set with the manually annotated labels as the ground truth. We use standard metrics for the extrinsic evaluation of a clustering method: homogeneity, completeness and normalised mutual information (NMI). Homogeneity gives a score for the purity of the clusters, giving a higher score for clusters with more of the same class. Completeness, on the contrary, measures how many of the same class are assigned to the same clusters. The NMI (also known as the V-Measure) is the harmonic mean of homogeneity and completeness, and is designed to balance both measures. We also examine the ground truth to cluster confusion matrics. RESULTS Table [tab:results] shows the evaluation metrics for the hierarchical clustering on the six search streams data set. Across all three measures, the combined paragraph2vec and TF-IDF method achieves the best performance. METHOD HOMOGENEITY COMPLETENESS NMI ---------------------- ------------- -------------- -------- TF-IDF 0.1959 0.4795 0.2782 Protomeme 0.2574 0.5875 0.3580 paragraph2vec 0.3482 0.3569 0.3525 pgrph2vec / prtomeme 0.3481 0.3567 0.3523 pgrph2vec / TF-IDF 0.4559 0.5041 0.4788 : Clustering evaluation on the six streams data Table [tab:results_search_conf] depicts the confusion matrix for the clusters to the search queries. It is clear that cluster 1 picks up both auspol and qanda, which can be related and often used in the same tweet. Clusters 2, 4 and 5 seem to be both capturing the btc query. Cluster 6 is clearly related to castro, and cluster 3 refers to trump. However, trump often appears in tweets with #auspol or #qanda, so is in cluster 1 as well. SEARCH QUERY 1 2 3 4 5 6 -------------- ------- ----- ----- ----- ----- ----- auspol 1,110 40 288 0 1 0 btc 48 278 78 729 192 0 castro 96 11 85 0 0 660 qanda 1,508 67 87 0 7 0 trump 215 35 342 0 0 0 : Clustering confusion matrix on the search queries data Table [tab:results_qanda] shows the evaluation metrics for the hierarchical clustering on the qanda data set. Again, the combined paragraph2vec and TF-IDF method achieves the best performance across all three measures. Of particular interest is that the paragraph2vec method performs poorly. This may be because many of the topics in this data set are strongly correlated with particular terms, hashtags, and user mentions. METHOD HOMOGENEITY COMPLETENESS NMI ---------------------- ------------- -------------- -------- TF-IDF 0.3031 0.3480 0.3240 Protomeme 0.0898 0.4081 0.1473 paragraph2vec 0.0688 0.0655 0.0671 pgrph2vec / prtomeme 0.0688 0.0655 0.0671 pgrph2vec / TF-IDF 0.4194 0.3837 0.4007 : Clustering evaluation on the qanda data CONCLUSION