Bayesian Learning of 2D Document Layout Models for Preservation Metadata Extraction

S. Mao and G.R. Thoma (USA)


2D Document layout models, preservation metadata extrac tion, Bayesian learning.


Digital preservation addresses the storage, maintainance, accessibility, and technical integrity of digital materials over the long term. Preservation metadata is the informa tion required to perform these tasks. Given the volume of these journals and high labor cost of manual metadata entry, automated metadata extraction is necessary. Docu ment layout analysis is a process of partitioning document images into hierarchically structured and labeled homoge neous physical regions. Descriptive metadata such as bibli ographic information can then be extracted from these seg mented and labeled regions using OCR. While numerous algorithms have been proposed for document layout analy sis, most of them require manually specified rules or mod els. In this paper, we first define the hierarchical 2D layout model of document pages as a set of attributed hidden semi Markov Models (HSMM). Each attributed HSMM repre sents the projection profile of the character bounding boxes in a physical region on either the X or Y axis. We then describe a Bayesian-based method to learn 2D layout mod els from the unstructured and labeled physical regions in a set of training pages. We compare the zoning and la beling performance of the learned HSMM-based model, a learned baseline model, and two rule-based systems on 69 test pages and show that the HSMM-based model has the best overall performance, and comparable or better perfor mance for individual fields.

Important Links:

Go Back