(For Submission) Simple Measures of Individual Cluster-Membership Certainty for Hard Partitional Clustering

We propose two posterior-probability-like measures of individual cluster-membership certainty which can be applied to a hard partition of the sample such as that obtained from the Partitioning Around Medoids (PAM) algorithm. One measure extends the individual silhouette widths and the other is obtained directly from the pairwise dissimilarities in the sample. Unlike the classic silhouette, however, the measures behave like probabilities and can be used to investigate an individual’s tendency to belong to a cluster. Motivated by an application to a clinical database, we evaluate the performance of both measures in individuals with ambiguous cluster membership, using simulated binary datasets that have been partitioned by the PAM algorithm. For comparison, we also present results from soft clustering algorithms such as soft analysis clustering (FANNY) and two model-based clustering methods. Our proposed measures perform comparably to the posterior-probability estimators from either FANNY or the model-based clustering methods.

{keyword}

Cluster-membership certainty \sepFANNY algorithm \sepModel-based clustering \sepPartitioning around medoids algorithm \sepSilhouette width \sepSoft clustering

Introduction

Clinical disease registries frequently contain information recorded in the form of categorical variables for each patient. To explore such data, we may wish to cluster the patients into similar groups. For example, we may wish to use patient symptoms at diagnosis to identify groups which respond differently to treatment. One approach to clustering with categorical data is Bayesian profile regression (Molitor et al., 2010), which has the ability to incorporate information on an outcome variable. The profile regression model is fitted to the data by use of a Markov Chain Monte Carlo algorithm, in which the number of clusters and cluster membership changes at each sweep (Liverani et al., 2015) and the co-occurrence of a pair of individuals in the same cluster is tracked. After completion of all sweeps, a similarity matrix is created by averaging the pairwise co-occurrences across the sweeps. Then individuals are assigned to clusters by applying the Partitioning Around Medoids or PAM algorithm (Kaufman et al., 1990) to the resulting dissimilarity matrix.

One limitation of this approach is that so-called “hard” partitional clustering algorithms such as PAM assign individuals to distinct clusters but do not provide a measure of the cluster-membership certainties for each individual. Yet, in many applied settings, cluster-membership certainties are desired to help identify individuals with ambiguous group memberships. One measure of how well an individual belongs to its assigned cluster is the silhouette (Rousseeuw, 1987). Silhouette values range between negative and positive one, with high values indicating that the individual is well matched to its assigned cluster relative to neighbouring clusters. In this note, we propose a simple extension of the silhouette from a single value pertaining to the individual’s assigned cluster to a vector of values pertaining to all the clusters in the partition. An attractive feature of the extension is that an individual’s values add to one across the clusters and thus provide a posterior-probability-like interpretation. Such an interpretation is helpful for assessing the individual’s membership uncertainty after the hard clustering has been performed. We also propose another posterior-probability-like measure of cluster-membership based directly on the dissimilarity matrix and the partition. The performance of the proposed measures is evaluated in a limited simulation study. Both measures behave similarly to posterior probabilities from model-based and fuzzy clustering. For researchers exploring their data with a hard partitional clustering algorithm, the proposed measures therefore offer a straightforward way to augment existing output and obtain posterior-probability-like measures of cluster membership uncertainty. Although our motivation is an application from Bayesian profile regression, the measures can be applied to any pairwise dissimilarity matrix and cluster membership assignment obtained from hard clustering.