A Parallel Web Page Identification System

J. Chung, R. Chau, and C.-H. Yeh (Australia)

Keywords

Web applications, parallel text alignment, knowledge discovery, Web mining system

Abstract

A parallel corpus is a rich linguistic resource for various multilingual text management tasks, including cross lingual text retrieval, multilingual computational linguistics and multilingual text mining. Constructing a parallel corpus requires the effective alignment of parallel documents which are translated versions of the same text. In this paper, we develop a parallel page identification system for identifying and aligning parallel documents in the Web environment. The system crawls the Web to fetch potentially parallel multilingual Web documents using a Web spider. To determine the parallelism between potential document pairs, two modules are developed. First, a filename comparison module is used to check filename resemblance. Second, a content analysis module is used to measure the semantic similarity. A preliminary experiment conducted on a Hong Kong government Web site shows the effectiveness of the system.

Important Links:



Go Back