Computational Methods for Identifying Number of Clusters in Gene Expression Data

H. Ressom, D. Wang, and P. Natarajan (USA)


Gene expression data, microarray, clustering, cluster validation.


With the advent of microarray technology, there is a growing need to reliably extract biologically significant information from massive gene expression data. Clustering is one of the key steps in analyzing gene expression data by identifying groups of genes that manifest similar expression patterns. Many algorithms for clustering gene expression data have been reported in the literature. However, there has been limited progress on cluster validation and identifying the number of clusters available in gene expression data. In this paper, we investigate the relative merits of four algorithms in clustering two gene expression data sets. The clustering methods we investigated are the poplar self-organizing maps (SOM), adaptive double self-organizing maps (ADSOM), fuzzy c-means (FCM), and model based clustering method. Their corresponding clusters are validated using figure of merit (FOM), a hierarchical tree based index, Xie-Beni index that gives a measure of compactness and separation of clusters, and an approximation called the Bayesian information criterion (BIC). Our intent is to provide with a useful guide for choosing the appropriate computational method for identification of number of clusters in gene expression data analysis. It was observed that ADSOM outsmarted the three other clustering methods in detecting the number of clusters available in the two gene expression data sets.

Important Links:

Go Back