Authorea

Daniel Stanley Tan edited untitled.tex about 8 years ago

Commit id: 8fe89f9ee2277787c6ee1313f7f437b6a4f42e03

deletions | additions

I am particularly interested in pursuing further research on visualizing big data. Visualizing data helps reveal interesting patterns from large data sets that might not be obvious in some representations. It also aids domain experts in extracting information, generating ideas, and formulating hypotheses from the data. However visualizing big and high dimensional data is challenging due to the human limitation of only being able to visualize up to three dimensions. A common way to handle these are through dimensionality reduction techniques like Self Organizing Maps (SOM) \cite{kohonen1990self}, Multidimensional Scaling (MDS) \cite{kruskal1964multidimensional} and Principal Components Analysis (PCA) \cite{dunteman1989principal} which maps high dimensional data into lower dimensions. This mapping inevitably loses information but these algorithms are creative in doing this in such a way that useful distances are preserved and information loss is minimized. The only problem is that the time complexity of these algorithms are exponential which is not suitable for handling big data. Parallelizable implementations of SOM \cite{carpenter1987massively}, MDS \cite{varoneckas2015parallel} and PCA \cite{andrecut2009parallel} exist but at the end of the day it only reduces the complexity by a linear factor, which may be good for now but it won't scale well for the future. Clustering is another technique used in data mining. For big data, the clustering algorithm needs to run in at least quasilinear time. There are many clustering algorithms that can do this such as BIRCH, DBSCAN, EM, and OPTICS to name a few. BFR and CLIQUE seems promising for the task of big data visualization. BFR (Bradley-Fayyad-Reina) algorithm is a variant of K-Means that can handle large data. The idea is that if we assume the clusters to be normally distributed then we can summarize the clusters using its mean and standard deviation, effectively reducing the number of data points to be processed in the succeeding iterations. The notion of summarizing the data points and creatively reducing the number of data points may be applied to visualization to increase the speed with minimal loss of information. CLIQUE on the other hand is a subspace clustering algorithm, it looks for clusters in subsets of the dimensions. This may be useful in reducing the number of dimensions and also in revealing patterns that may be hidden due to the inclusion of some dimensions.