this is for holding javascript data
Daniel Stanley Tan edited untitled.tex
about 8 years ago
Commit id: 383a124f021c0f771f7e4f1f2f9d3a947de83788
deletions | additions
diff --git a/untitled.tex b/untitled.tex
index e624b8f..ed16c2c 100644
--- a/untitled.tex
+++ b/untitled.tex
...
In the recent years, there has been an explosion of Visualizing data
and it continues to grow by the second. In fact, data generated in the past decade is much larger than all data collected in the past century combined \cite{data2013}. This enables scientists to get a deeper understanding of the helps reveal interesting patterns from large data
sets that
was might not
possible before. But be obvious in some representations. It also aids domain experts in extracting information, generating ideas, and formulating hypotheses from the
huge amount of data. However visualizing big and high dimensional data
being collected also poses a bunch of new problems. Data is
growing faster than manufacturers can build computers that can process them \cite{chips2016}. Traditional techniques for data analytics are not capable of analyzing these huge amounts of data challenging due to
their processing time increasing exponentially as the number of data increases. To make matters more challenging, this is usually coupled with high dimensionality thus, increasing the
complexity human limitation of
the problem further. No algorithm exists yet that tackles all the problems of handling big data but there has been many works that address some aspects of it. \cite{xu2016exploring} only being able to visualize up to three dimensions.
I am particularly interested Indeed, in
pursuing further research on visualizing big data. Visualizing the recent years, there has been an explosion of data
helps reveal interesting patterns from large and it continues to grow by the second. In fact, data
sets that might not be obvious generated in
some representations. It also aids domain experts the past decade is much larger than all data collected in
extracting information, generating ideas, and formulating hypotheses from the
data. However visualizing big and high dimensional past century combined \cite{data2013}. Traditional techniques for data analytics are not capable of analyzing these huge amounts of data
is challenging due to
their processing time that increases exponentially as the
human limitation number of
only being able to visualize up to three dimensions. data increases. To make matters more challenging, this is usually coupled with high dimensionality thus, increasing the complexity of the problem further. For now, no algorithm exists that tackles all the problems of handling big data, although there has been many works that address some specific aspects of it. \cite{xu2016exploring}
A common way Some existing ways to
handle these visualize high-dimensional data are through dimensionality reduction techniques like Random Projections \cite{bingham2001random,kaski1998dimensionality}, Self Organizing Maps (SOM) \cite{kohonen1990self}, Multidimensional Scaling (MDS) \cite{kruskal1964multidimensional} and Principal Components Analysis (PCA) \cite{dunteman1989principal} which significantly reduce the dimensions by mapping high dimensional data into lower dimensions. This mapping inevitably loses information but these algorithms are creative in doing this in such a way that useful distances are preserved and information loss is minimized. The only problem is that the time complexity of these algorithms are exponential which is not suitable for handling big data. Parallelizable implementations of SOM \cite{carpenter1987massively}, MDS \cite{varoneckas2015parallel} and PCA \cite{andrecut2009parallel} exist but it only reduces the complexity by a linear factor, which may be good for now but it won't scale well for the future.
Clustering is another technique used in data mining. For big data, the clustering algorithm needs to run in at least quasilinear time. There are many clustering algorithms that can do this such as BIRCH \cite{zhang1996birch}, FCM \cite{bezdek1984fcm}, DBSCAN \cite{ester1996density}, EM \cite{dempster1977maximum}, and OPTICS \cite{ankerst1999optics} to name a few. BFR (Bradley-Fayyad-Reina) \cite{bradley1998scaling} and CLIQUE \cite{agrawal1998automatic} seems promising for the task of big data visualization. BFR (Bradley-Fayyad-Reina) algorithm is a variant of K-Means that can handle large data. The idea is that if we assume the clusters to be normally distributed then we can summarize the clusters using its mean and standard deviation, effectively reducing the number of data points to be processed in the succeeding iterations. The notion of summarizing the data points and creatively reducing the number of data points may be applied to visualization to increase the speed with minimal loss of information. CLIQUE on the other hand is a subspace clustering algorithm, it looks for clusters in subsets of the dimensions. This may be useful in reducing the number of dimensions and also in revealing patterns that may be hidden due to the inclusion of some dimensions.