Evaluation of Dimensionality Reduction Techniques for SOM Clustering of Textual Data

N. Ampazis and S.J. Perantonis (Greece)



The dimensionality of the data vectors which represent textual document collections, encoded according to the Vector Space Model (VSM), is usually very high, and this results in burdensome computations for the majority of clustering algorithms. It is therefore benefitial to reduce the dimensionality of the data vectors before the application of any clustering algorithm which is based on the computation of distances in the original data space. Two of the most frequently used dimensionality reduction methods are Principal Component Analysis (PCA) (or Latent Semantic Indexing LSI) and Random Projection (RP). A third dimensionality reduction method can be constructed as a two-step approach where firstly a random projection to a lower dimension is applied to the initial corpus which is then followed by LSI (RP/LSI). However, empirical results for all these methods are sparse, especially for the evaluation of the effects that these data representation techniques have on the ability of the Self Organizing Map (SOM) algorithm to semantically cluster textual data. In this paper we use a well defined measure for comparing the similarity of different maps trained with the three different representations of the original textual data set, and we illustrate that RP and RP/LSI produce maps whose quality is equivalent to that of maps trained with the computationally expensive LSI representations.

Important Links:

Go Back