LAMP Seminar
Language and Media Processing Laboratory
Conference Room 4406
A.V. Williams Building
University of Maryland

Friday, Jan 24, 3 PM
A Statistical Methodology for Validating Document Image Degradation Models

Tapas Kanungo
Caere Corporation


ABSTRACT

Models for documents image degradations are crucial in many ways. Models allow us to (i) conduct controlled experiments to study the break-down points of systems, (ii) create large data sets with groundtruth for training classifiers, (ii) design optimal noise removal algorithms, (iv) choose values for the free parameters of the algorithms, etc. Although a few degradation models have been proposed in the literature, none of them have been validated against real-world degradations. The stumbling block has been that it is difficult to map the [roblem "do these simulated, degraded images look similar to the real, scanned images" into a quantitative hypothesis-testing problem.
In this talk I will describe a statistical methodology that can be used to validate document image degradation models. This method is based on a non-parametric, two-sample permutation test. Another standard statistical device -- the power function -- is then used to choose between validation algorithm varialbes ( such as distance functions). Since the validation and power function procedures are independent of the model, they can be used to validate any other degradation model. finally, I will describe a method for comparing any two models. It uses p-values associated with the estimated models to select the model that is closer to the real world.

(This work was done in collaboration with Professors R. M. Haralick (EE,UW), Werner Stuetzle (Stat,UW), David Madigan (Stat,UW), and Dr. Henry Baird (Bell Labs).)




home | language group | media group | sponsors & partners | publications | seminars | contact us | staff only
© Copyright 2001, Language and Media Processing Laboratory, University of Maryland, All rights reserved.