Similar Documents Retrieval to Help Browsing and Editing in Digital Repositories

F. de A. Barros, E.F.A. Silva, J.C.B. Rabelo, and F.B. Fernandes (Brazil)


Web Information Retrieval,Approximate Retrieval, Search for Similar Documents.


The fast growth of electronic text collections (in particular, the Web) and the diversity of available documents immensely increased the difficulty to retrieve relevant documents in an efficient way. A variety of Web search engines have been built to help users in this task. These systems, however, lack precision in the retrieved documents. Different solutions to improve retrieval precision have been proposed, however they did not show to be satisfactory so far. Investigating a new approach to this problem, we developed the ActiveSearch system, a standalone application for suggesting to the user similar documents to the one being browsed or edited. It processes different document formats (e.g., HTML, DOC) in different repositories, focusing on the Internet, local area networks and local directories. Once activated, the system consults the repository being accessed by the user and reorders the list of retrieved documents according to their similarity to the current document's content and format, considering as well the user's preferences (registered in the user's profile). The documents in the reordered list may further be grouped into dynamic clusters, in order to facilitate the visualization of the results. Undergone tests showed a very good system's performance, with precision rates of 57%. The system's performance was compared to the Google Toolbar, showing a superiority of 34 percentile points in the precision rate.

