ROUGH DRAFT authorea.com/101740
Main Data History
Export
Show Index Toggle 0 comments
  •  Quick Edit
  • Draft

    Background

    Visualizing data help reveal interesting patterns from the data that might not be obvious in some representations. It also aids domain experts in extracting information, generating ideas, and formulating hypotheses from the data, which is why data visualization plays a huge role in the data analytics process. However, visualizing high dimensional data is challenging due to the human limitation of only being able to visualize up to three dimensions. Moreover, traditional techniques are also incapable of visualizing huge amounts of data due to their long processing time which increases exponentially as the number of data points increases. This poses a problem because the data being generated in the world is rapidly growing. In fact, data generated in the past decade is much larger than all the combined data collected in the past century (Dragland 2013). For now, no algorithm exists that tackles all the problems of handling big data, although there has been many works that address some specific aspects of it (Xu 2016).

    Some existing ways for tackling high-dimensional data are through dimensionality reduction techniques. These would include Random Projections (Bingham 2001, Kaski 1998), Multidimensional Scaling (MDS) (Kruskal 1964) and Principal Components Analysis (PCA) (Dunteman 1989). These algorithms significantly reduce the number of dimensions by mapping the high dimensional data into lower dimensions. This mapping inevitably lose information but these algorithms are creative in doing this in such a way that useful distances are preserved and information loss is minimized.

    For data visualization, the number of dimensions have to be reduced to at most three dimensions. The most commonly used dimensionality reduction techniques for visualizing high dimensional data are Self Organizing Maps (SOM) (Kohonen 1990), Multidimensional Scaling (MDS) (Kruskal 1964) and Principal Components Analysis (PCA) (Dunteman 1989). All three algorithms reduce dimensions based on certain properties such as local neighborhood relations for SOM, inter-point distances for MDS, and data variance for PCA. The only problem is that the time complexity of these algorithms are exponential. This is not suitable for handling big data. Parallelizable implementations of SOM (Carpenter 1987), MDS (Varoneckas 2015) and PCA (Andrecut 2009) exist but this only reduces the complexity by a linear factor, which may be good for now but will not scale well for even larger datasets.

    Clustering is a technique used in data mining that is useful in handling big data. It groups the data points together into clusters in such a way that the data points within a cluster are similar to each other, while the maximizing the dissimilarity across clusters. Clustering algorithms need to run in at most quasilinear time to be efficient for big data. There are many clustering algorithms that can do this such as BIRCH (Zhang 1996), FCM (Bezdek 1984), DBSCAN (Ester 1996), EM (Dempster 1977), and OPTICS (Ankerst 1999) to name a few.

    BFR (Bradley-Fayyad-Reina) (Bradley 1998) and CLIQUE (Agrawal 1998) seems promising for the task of big data visualization. BFR (Bradley-Fayyad-Reina) algorithm is a variant of k-Means that can handle large data. The idea is if we assume the clusters to be normally distributed then we can summarize the clusters by keeping only the mean and standard deviations of the clusters and throwing away all the specific samples in the cluster. Effectively, this reduce the number of data points to be processed in the succeeding iterations. CLIQUE on the other hand is a subspace clustering algorithm, it looks for clusters in subsets of the dimensions and may be useful in reducing the number of dimensions and also in revealing patterns that may be hidden due to other dimensions that may not be relevant.

    Proposed Topic for Research

    \label{fig:flow}Envisioned flow of the interactive visualization tool.