Authorea

Daniel Stanley Tan edited untitled.tex about 8 years ago

Commit id: e0d082c5da1d1f4f92eb43f7ba8377fba30695ef

deletions | additions

\section{Background of the Proposed Topic} Visualizing data helps reveal interesting patterns from the data that might not be obvious in some representations. It also aids domain experts in extracting information, generating ideas, and formulating hypotheses from the data. Which is why data visualization plays a huge role in the data analytics process. However visualizing high dimensional data is challenging due to the human limitation of only being able to visualize up to three dimensions. Traditional Moreover, traditional techniques are also incapable of visualizing huge amounts of data due to their long processing time that which increases exponentially as the number of data points increases and increases. This poses a problem because the data being generated in the world is also continues to grow exponentially. rapidly growing. In fact, data generated in the past decade is much larger than all data collected in the past century combined \cite{data2013}. For now, no algorithm exists that tackles all the problems of handling big data, although there has been many works that address some specific aspects of it. \cite{xu2016exploring} Some existing ways for tackling high-dimensional data are through dimensionality reduction techniques like Random Projections \cite{bingham2001random,kaski1998dimensionality}, Multidimensional Scaling (MDS) \cite{kruskal1964multidimensional} and Principal Components Analysis (PCA) \cite{dunteman1989principal}. These algorithms significantly reduce the number of dimensions by mapping the high dimensional data into lower dimensions. This mapping inevitably lose information but these algorithms are creative in doing this in such a way that useful distances are preserved and information loss is minimized. For data visualization, the number of dimensions have to be reduced to at most three dimensions. The most commonly used dimensionality reduction techniques for visualizing high dimensional data are Self Organizing Maps (SOM) \cite{kohonen1990self}, Multidimensional Scaling (MDS) \cite{kruskal1964multidimensional} and Principal Components Analysis (PCA) \cite{dunteman1989principal}. All three algorithms reduce dimensions based on certain properties such as local neighborhood relations for SOM, inter-point distances for MDS, and data variance for PCA. The only problem is that the time complexity of these algorithms are exponential which is not suitable for handling big data. Parallelizable implementations of SOM \cite{carpenter1987massively}, MDS \cite{varoneckas2015parallel} and PCA \cite{andrecut2009parallel} exist but it only reduces the complexity by a linear factor, which may be good for now but it will not scale well for larger and larger datasets. Some existing ways to visualize high-dimensional data are through dimensionality reduction techniques like Random Projections \cite{bingham2001random,kaski1998dimensionality}, Self Organizing Maps (SOM) \cite{kohonen1990self}, Multidimensional Scaling (MDS) \cite{kruskal1964multidimensional} and Principal Components Analysis (PCA) \cite{dunteman1989principal} which significantly reduce the dimensions by mapping high dimensional data into lower dimensions. This mapping inevitably loses information but these In processing big data, algorithms are creative in doing this need to run in such a way that useful distances are preserved and information loss is minimized. The only problem is that the time complexity of these algorithms are exponential which is not suitable for handling big data. Parallelizable implementations of SOM \cite{carpenter1987massively}, MDS \cite{varoneckas2015parallel} and PCA \cite{andrecut2009parallel} exist but it only reduces the complexity by a linear factor, which may be good for now but it won't scale well for the future. at most quasilinear time. Clustering is another technique used in data mining. For big data, the clustering algorithm needs to run in at most quasilinear time. There are many clustering algorithms that can do this such as BIRCH \cite{zhang1996birch}, FCM \cite{bezdek1984fcm}, DBSCAN \cite{ester1996density}, EM \cite{dempster1977maximum}, and OPTICS \cite{ankerst1999optics} to name a few. BFR (Bradley-Fayyad-Reina) \cite{bradley1998scaling} and CLIQUE \cite{agrawal1998automatic} seems promising for the task of big data visualization. BFR (Bradley-Fayyad-Reina) algorithm is a variant of K-Means that can handle large data. The idea is that if we assume the clusters to be normally distributed then we can summarize the clusters using its mean and standard deviation, effectively reducing the number of data points to be processed in the succeeding iterations. The notion of summarizing the data points and creatively reducing the number of data points may be applied to visualization to increase the speed with minimal loss of information. CLIQUE on the other hand is a subspace clustering algorithm, it looks for clusters in subsets of the dimensions. This may be useful in reducing the number of dimensions and also in revealing patterns that may be hidden due to the inclusion of some dimensions.