An Enhanced Feature Selection Method for Text Classification

Jinbeom Kang; E. Lee; K. Hong; Jeahyun Park; T. Kim; Juyoung Park; J. Choi; J. Yang

An Enhanced Feature Selection Method for Text Classification

Jinbeom Kang, E. Lee, K. Hong, Jeahyun Park, T. Kim, Juyoung Park, J. Choi, and J. Yang (Korea)

Keywords

feature selection, impurity of words, unbalanced distribu tion, machine learning, text classiﬁcation

Abstract

Feature selection in machine learning is a task of identify ing a set of representative terms or features from a docu ment collection that are mainly used in text classiﬁcation. Existing feature selection methods including information gain and χ2 -test focus on those features that are useful for all topics, and consequently lack the power of selecting those features that are truly the representatives of a par ticular topic (or class). Also, these methods assume that the distribution of documents for each class is balanced. However, this assumption affects negatively to the classi ﬁcation accuracy because real-world document collections rarely have a balanced distribution, and also it is difﬁcult to prepare a set of training documents with even number of documents for each class. To resolve this problem, we propose a new feature selection method for text classiﬁcation that focuses on the purity of a word that emphasizes its representativeness for a particular class. Also our method assumes unbalanced distribution of documents over multiple classes, and com bines feature values with the weight factors that reﬂect the number of training documents in each class. In summary, we can obtain feature candidates using the word purity and then select the features with the unbalanced distribution of documents. Via some experiments, we demonstrate that our method outperforms existing methods.

Important Links:

DOI:
From Proceeding (523) Computational Intelligence - 2006

Go Back