Heuristically, we set a decision threshold to distinguish between motifs and discords. This threshold can be based on the word frequency count for each pattern as a percentage of the count of all observations. This threshold can be tuned to result in a manageable number of discord candidates to be further analyzed. More details pertaining to setting this threshold will be discussed on the applied case studies.

In the two week example, this process yields two patterns which have a frequency greater than one and thus are the motif candidates. A manual review of the data confirms that those patterns match with a normally expected profile for a typical weekday (\(acca\)) and weekend (\(aaaa\)). The less frequent patterns are tagged as discords and can be analyzed in more detail. In this case it can be determined that the patterns \(abba\), \(abca\), and \(acba\), despite being infrequent, are not abnormal due to the occupancy schedule for those particular days. Pattern \(ccba\), however, is not explainable within the scheduling and is due to a fault causing excessive consumption in the early morning hours.

This step leads into the next phase of the process focused on further aggregating the motif candidates of the dataset. The size and number of potential motif filtered in this step will give an indication of the number of clusters which will most likely pick up meaningful structure from the dataset.

Clustering

After dividing the profiles into motif and discord candidates, we go on to cluster the motif candidates to create general daily performance phenotypes of the targeted data stream. This step is supplementary if the SAX transformation process produces too many motif candidates based on the input parameter settings. Clustering would be useful, for example, if 15 motif candidates are created and the user wants to further aggregate those candidates into 4 or 5 typical profiles for simulation calibration purposes. This feature gives the user additional control to further aggregate the performance characterization, which can be useful when choosing large values of \(A\) or \(W\) in the SAX process. It should be noted that in some simplified cases or small datasets, this step may be redundant with SAX aggregation.

We use k-means to cluster the daily profiles after removing the discord candidate day-types. This ensures load profile patterns that are not influenced by the less frequent discords. Time series clustering can be approached as a raw-data-based, feature-based, or model-based solution \cite{WarrenLiao:2005bq}. Numerous clustering techniques have been developed and evaluated for various contexts and optimization goals. The most common implementation is the k-means clustering algorithm and we chose to use it with the euclidean distance measure due to its simplicity and demonstrated appropriateness for this application \cite{Iglesias:2013ja,MacQueen:1967uv}. The algorithm in our application takes our daily chunks \((N_1, N_2, ..., N_n)\) and partitions these observations into \(k\) sets, \(S = \{{S_1, S_2, ..., S_k}\}\) so as to minimize the within-cluster sum of squares \cite{Rokach:2005ti}:

\[{\arg\!\min}\sum_{i=1}^{k}\sum_{N_j\in S_i} \parallel N_j - \mu_i \parallel ^2 \label{eq:kmeans}\]

where \(\mu_i\) is the mean of the points in \(S_i\).

The disadvantages of the k-means are related to the need to specify the number of clusters and the selection of the initial partition. Both of these faults are not obviously detrimental to this particular application and it is outside the scope of this work to test various clustering algorithms.

Each clustering step includes the calculation of two internal validation metrics which statistically evaluate how well the k-means algorithm was able to create distinct groups of daily profiles, the silhouette coefficient and the sum of square error. The silhouette coefficient score is a measurement of the inter-cluster cohesiveness and intra-cluster separation; a score of 1 is best and -1 is the worst. The coefficient is calculated with the following equation\cite{Rousseeuw:1987wr}:

\[s = \frac{b - a}{\max{(a,b)}} \label{eq:silhouette_eq}\]

where \(a\) is the mean distance between a sample and all other points in the same cluster and \(b\) is the mean distance between a sample and all other points in the next nearest cluster. Minimum sum of square error is another metric that indicates the tightness of clusters with a smaller error being more desirable. Sum of square errors is calculated using the following equation \cite{Rokach:2005ti}:

\[SSE = \sum_{i=1}^{k}\sum_{N_j\in S_i} \parallel N_j - \mu_i \parallel ^2 \label{eq:sumofsquare_eq}\]

where the variables are the same as those from Equation \ref{eq:kmeans}.

Expressive visualization for interpretation

As the final step, interpretation and visualization are important for DayFilter in order for a human analyst to visually extract knowledge from the results, and to make decisions regarding further analysis. We utilize insight from the Overview, zoom and filter, details-on-demand approach \cite{Shneiderman:1996jt} and the previously mentioned VizTree tool \cite{Lin:2004wv}. The hidden structures of building performance data is revealed through the clustering process and we use visualization to communicate this structure to an analyst. The process uses a modified sankey diagram to visualize the augmented suffix tree in a way which the count frequency of each SAX word can be distinguished. Figure \ref{fig:saxdiscordsankeyheatmap} shows how this visualization is combined with a heatmap of the daily profiles associated with each of the SAX words using the same two-week example data from Figures \ref{fig:saxcreation} and \ref{fig:saxdiscordsankey}. The sankey diagram is rearranged according to the frequency threshold set to distinguish between the motif and discord candidates.