Text Categorization of Commercial Web Pages

E. Binaghi, M. Carullo, I. Gallo, and M. Madaio (Italy)


Text categorization, Kohonen Self-Organizing Map, neural network, multilayer perceptron


In this paper we describe a new on-line document catego rization strategy that can be integrated within Web applica tions. A salient aspect is the use of neural learning in both representation and classification tasks. Within text docu ments conceived as images, the regions of interest (RoI) containing information meaningful for categorization are identified with the support of a supervised neural network. Text within RoI is represented according to a simple solu tion that consider the first K words in the text and code them properly. A Kohonen Self-Organizing Map (SOM) is ap plied to cluster documents that are subsequently labelled by applying a simple majority voting mechanism. Solutions adopted were evaluated by conducting experiments within the context of on-line price comparison services. Results obtained demontrate that the overall classification strategy is able to categorize documents satisfectorily taking into account the high variability of Web pages.

Important Links:

Go Back