J. Chung, R. Chau, and C.-H. Yeh (Australia)
Web applications, parallel text alignment, knowledge discovery, Web mining system
A parallel corpus is a rich linguistic resource for various multilingual text management tasks, including cross lingual text retrieval, multilingual computational linguistics and multilingual text mining. Constructing a parallel corpus requires the effective alignment of parallel documents which are translated versions of the same text. In this paper, we develop a parallel page identification system for identifying and aligning parallel documents in the Web environment. The system crawls the Web to fetch potentially parallel multilingual Web documents using a Web spider. To determine the parallelism between potential document pairs, two modules are developed. First, a filename comparison module is used to check filename resemblance. Second, a content analysis module is used to measure the semantic similarity. A preliminary experiment conducted on a Hong Kong government Web site shows the effectiveness of the system.
Important Links:
Go Back