Feature-based Cluster Validation for High-Dimensional Data

R. Kassab and J.-C. Lamirel (France)


Unsupervised learning, model selection, feature-based cor relation analysis, cluster validity, high-dimensional data


Cluster validation is commonly used to determine the op timal number of clusters in a data set. Despite the suc cess of distance-based validity indexes, their efficacy de creases rapidly when dealing with high-dimensional data. The present paper introduces a feature-based cluster val idation criterion which can cope with said situation. In contrast to distance-based methods, our criterion evaluates similarity in terms of shared relevant features between data. The idea is based on the identification of the “core” features which are correlated within the description of each of the discovered clusters. The individual quality of each clus ter is then evaluated through the frequency of the core fea tures with respect to that of the non-core features within the cluster, while the between-cluster isolation is measured by means of the overlap coefficient between clusters, consid ering only the core features within the clusters. The overall clustering quality is measured by a weighted combination of the within and between cluster correlation coefficients, which enables choosing an appropriate number of clusters according to the purpose of clustering. Furthermore, our validation can prune out unreliable clusters which have no correlated features and thus no specific description of their content. Extensive experiments on the Reuters-21578 col lection are conducted to show the effectiveness of our vali dation criterion.

Important Links:

Go Back