Modeling multiple hashtag time series

Nov 8 - Nov 15

Principal Components

Supposing we now have \(N\) hashtag point processes. Here we present analysis based on the daily count process.

We denote the daily counts for hashtag \(i\) as \(x_i(t)\), where \(t\) takes integer values in \([t_{\min}, t_{\max}]\). In our data, we have the year-long time window of 2012.

In the following, we show the daily counts with aligned time axis for a few top-trending hashtags.

The daily count series for 10 top-trending hashtags: “bullying” “bully” “stopbullying” “bullymovie” “spiritday” “ripamandatodd” “lgbt” “bullied” “teamfollowback” “ripboybeliebermartin”

It can be observed that there are big single spikes that are (1) highly localized in time and (2) shared by more than one hashtags. In the following, we will use PCA to identify those singleton spikes.

Let \(X = [x_1, x_2, \cdots, x_N]\) be a \(T \times N\) matrix, of which a row corresponds to a day and a column corresponds to a hashtag. We performed standard PCA (centered, unscaled) on top \(N=200\) hashtags. By treating days as “observations” and hashtags as covariates, we obtain principle components \(z_i \in \mathbb{R}^T\) for \(i=1,\cdots, N\). The variance explained declined rapidly, with top 4 principle components dominating the dataset.

Variance residual with top-k principal components.