Extraction of Arabic Words from Multilingual Documents

I. Moalla, A.M. Alimi, and A. Benhamadou (Tunisia)

Keywords

Heterogeneous blocks, Document Analysis, Scripts discrimination, word level.

Abstract

Latin script words are now commonly used in Arabic script documents. An OCR developed for the Arabic script will wrongly recognize the words in Latin script. So it is necessary to filter out these Latin script words before feeding the Arabic script words to the Arabic OCR. Which gives rise to the need to develop an automatic script recognition system for words in Arabic and Latin scripts. In this paper we present a method which can filter out Latin words from heterogeneous blocks. The method is based on a rapid filtering process that uses morphological and statistical features of Arabic script such as: overlapping and inclusion of bounding boxes, horizontal bar, low diacritics, Height and width variation of connected components, etc. Out of tests, our method has shown its efficiency in the discrimination between Arabic and Latin script at word level. Data set of words is extracted from the "Directory of North of Africa" and the results of the word identification reaches 98% on 1435 words.

Important Links:



Go Back