A Novel Effective Distance Measure and a Relevant Algorithm for
Optimizing the Initial Cluster Centroids of K-means
- Liu Yang,
- Ma Shuaifeng ,
- Du Xinxin
Abstract
The traditional K-means algorithm is very sensitive to the selection of
clustering centers and the calculation of distances, so the algorithm
easily converges to a locally optimal solution. In addition, the
traditional algorithm has slow convergence speed and low clustering
accuracy, as well as memory bottleneck problems when processing massive
data. Therefore, an improved K-means algorithm is proposed in this
paper. In this algorithm, the selection of the initial points in the
traditional clustering algorithm is improved first, and then a new
global measure, the effective distance measure, is proposed. Its main
idea is to calculate the effective distance between two data samples by
sparse reconstruction. Finally, on the basis of the MapReduce framework,
the efficiency of the algorithm is further improved by adjusting the
Hadoop cluster. Based on the real customer data from the JD Mall
dataset, this paper introduces the DBI, Rand and other indicators to
evaluate the clustering effects of various algorithms. The results show
that the proposed algorithm not only has good convergence and accuracy
but also achieves better performances than those of other compared
algorithms.