Mark Tozer

and 1 more

1) Clustering is indispensable in the quest for robust vegetation classification schemes which aim to partition continua to summarise and communicate pattern. However, clustering solutions are sensitive to methods and data and are therefore unstable, a feature which is usually attributed to noise. Viewed through a central-tendency lens, noise is defined as the degree of departure from type, which is problematic since vegetation types are abstractions of continua and so noise can only be quantified relative to a particular solution to hand. Graph theory models the structure of vegetation data based on the interconnectivity of samples. Through a graph-theoretic lens, the causes of instability can be quantified in absolute terms via the degree of connectivity among objects. 2) We simulated incremental increases in sampling intensity in a dataset over five iterations and assessed classification stability across successive solutions derived using algorithms implementing, respectively, models of central-tendency and interconnectivity. We used logistic regression to model the likelihood of a sample changing groups between iterations as a function of distance to centroid and degree of interconnectivity. 3) Our results show that the degree to which samples are interconnected is a more powerful predictor of instability than the degree to which they deviate from their nearest centroid. The removal of weakly interconnected samples resulted in more stable classifications, although solutions with many clusters were apparently inherently less stable than those with few clusters, and improvements in stability flowing from the removal of outliers declined as the number of clusters increased. 4) Our results reinforce the fact that clusters abstracted from continuous data are inherently unstable, and that the quest for stable, fine-scale classifications from large regional datasets is illusory. Nevertheless, our results show that using models better suited to the analysis of continuous data may yield more stable classifications of the available data.

Mark Tozer

and 1 more

Abstract Questions: Most clustering methods assume data are structured as discrete hyper-spheroidal clusters to be evaluated by measures of central-tendency. If vegetation data do not conform to this model, then vegetation data may be clustered incorrectly. What are the implications for cluster stability and evaluation if clusters are of irregular shape or density? Location: Southeast Australia Methods: We define misplacement as the placement of a sample in a cluster other than (distinct from) its nearest neighbour and hypothesise that optimising homogeneity incurs the cost of higher rates of misplacement. The Chameleon algorithm emphasises interconnectivity and thus is sensitive to the shape and distribution of clusters. We contrasted its solutions with those of traditional non-hierarchical and hierarchical (agglomerative and divisive) approaches. Results: Chameleon-derived solutions had lower rates of misplacement and only marginally higher heterogeneity than those of k-means in the range 15–60 clusters, but their metrics converged with larger numbers of clusters. Solutions derived by agglomerative clustering had the best metrics (and divisive clustering the worst) but both produced inferior high-level solutions clusters to those of Chameleon by merging distantly-related clusters. Conclusions: Our results suggest that Chameleon may have an advantage over traditional algorithms at when data exhibit discontinuities and variable structure, potentially producing more stable solutions (due to lower rates of misplacement), but scoring lower on traditional metrics of central-tendency. Chameleon’s advantages are less obvious in the partitioning of data from continuous gradients, however its graph-based partitioning protocol facilitates hierarchical integration of solutions.