Filtering XML Target Documents for EDMS

J.-W. Lee and K. Lee (Korea)


XML, information filtering, EDMS, document similarity, document management


XML allows users to define elements using arbitrary words and organize them in a nested structure. It supports flexibility for data presentation and integration with RDBMS. With these features of XML, many Electronic Document Management Systems(EDMS) have employed XML as a representative document format. EDMS obtains XML documents from the Web, other company, or users and needs to integrate the existing XML documents with newly obtained documents. It also requires techniques to classify and retrieve integrated documents effectively. Therefore, we need to develop a technique for filtering XML target documents, which have specific purpose or style that user wants or match sample documents. In this paper, we propose a methodology for filtering target documents from a tangled bunch of XML documents. We extract XML features−elements and nested structure− and discover similar features between XML documents. Then we compute similarity between XML documents using features determined as similar and filter documents over the threshold. As a result with Yahoo! pages, we got almost 100% accuracy for filtering XML target documents according to each category.

