Reducing the Dimensionality of Bag-of-words Text Representation Used by Learning Algorithms

C. Aparecida Martins, M.C. Monard, and E. Takashi Matsubara (Brazil)

Keywords

Text Mining, Machine Learning, Inductive Learning.

Abstract

The attribute-value representation of documents used in Text Mining provides a natural framework for classifying or clustering documents based on their content. Supervised learning algorithms can be applied whenever the docu ments have labels preassigned or unsupervised learning for unlabeled documents. The attribute-value representation of documents is characterized by very high dimensional data since every word in the document may be treated as an at tribute. However, the representation of documents has a crucial influence on how well some supervised learning al gorithm can generalize. This work presents a way to effi ciently decomposing text into words (stems) using the bag of-words approach as well as reducing the dimensionality of its representation, making text accessible to most Ma chine Learning algorithms that require each example be described by a vector of fixed dimensionality. A compu tational tool we have implemented is used on a real case in order to illustrate our proposal as well as several of the facilities implemented in the tool which allow to improve the accuracy of classifiers through the reduction of the di mensionality of text representation.

Important Links:



Go Back