Japanese Text Classification using N-Gram and the Maximum Ratio of Term Frequency among Categories

M. Suzuki (Japan)

Keywords

Text mining, automatic text categorization, Naive Bayes, and N-gram

Abstract

In this paper, we consider the automatic text classification as a series of information processing and propose a new classification technique called the Maximum Frequency Ratio Accumulation Method (MFRAM). This is a simple technique that adds up the maximum ratios of term frequency among categories. However, in MFRAM, feature terms can be used without limit. Therefore, we propose the use of Character N-gram and Word N-gram as feature terms using the above-described property of MFRAM. Next, we evaluate the proposed technique through some experiments. Our experiments classify articles from Japanese newspaper “CD-Mainichi 2002” using the Naive Bayes method (baseline method) and the proposed method. As a result, we show that the proposed method outperforms the baseline method greatly. That is, the classification accuracy of the proposed method was 88.7%. Thus, the proposed method has a very high performance. Though the proposed method is a simple technique, it has a new viewpoint, a high potential, so it can be expected the development in the future.

Important Links:



Go Back