Language Identification in Document Analysis (LIDA)

K. Jambi, M. Saleh, and H. Al-Barhamtooshi (Saudi Arabia)

Keywords

Document Analysis, AI, Agent Technology.

Abstract

This paper presents a technique that can be used to discriminate between texts written in Arabic script and texts written in Latin script. This technique addresses the language identification problem on the word level and on the text line level. This technique uses an algorithm for horizontal projection profiles. This paper presents a new algorithm of language identification to determine languages of a document. This approach may be used in identifying the language in many applications. These applications cover encoding of document pages, language specific web crawling, information retrieval, natural language processing, text mining, translation service bureau software, spell checking software, stemming or morphological analyzers, and knowledge management systems.

Important Links:



Go Back