Does IDF Improve Declustering Techniques for Keyword-based Information Retrieval?

S. Behl and R.M. Verma (USA)


Declustering, streaming model, text data, load balanc ing.


Multiple-disk architectures are an attractive approach to meet high performance I/O demands in I/O inten sive applications such as search engines, web servers and information retrieval systems. This requires that the issues of dynamic load balancing and access paral lelism be addressed, which is the goal of this paper. In this paper we address the problem of document declus tering in a keyword-based information retrieval system for parallel architectures consisting of a single proces sor and multiple disks. In the vector-space informa tion retrieval model the inverse-document-frequency factor was found to improve the query performance. We study whether this is the case for declustering as well. We propose and evaluate experimentally four similarity-based methods, viz., vector, euclidean, vec tor without idf, and euclidean without idf for declus tering documents. Interestingly, our results show that the vector method significantly outperforms the vector without idf, but the euclidean method is only slightly superior to the euclidean method without idf in one scenario and too close to call in another scenario. The vector method is the best for the so-called simple plan. We also introduce a highest-frequency first retrieval scenario and compare the methods under this scenario, and find the methods are too close to call in this case.

Important Links:

Go Back