Text Classification based on the Bias of Word Frequency over Categories

M. Suzuki (Japan)


text categorization, automatic classification, vector space model, tfidf


In automatic text classification, for example, for classifying newspaper articles into predefined categories such as politics and sports, the crucial step is how to select appropriate keywords. With traditional classification methods based on the vector space model, frequent words are emphasized and therefore low frequency words tend to be disregarded. However, there often exist low-frequency words that are effective for classification. For instance, technical terms appear in specific categories so their frequencies are generally low, even though they are effective keywords. In this paper, we propose two text classification methods, namely, NDF method and accumulation method, that are based on the bias of word frequency distribution over categories. Our experiments show that our accumulation method outperforms a traditional method based on the vector space model.

Important Links:

Go Back