Models
for documents image degradations are crucial in many ways. Models
allow us to (i) conduct controlled experiments to study the break-down
points of systems, (ii) create large data sets with groundtruth
for training classifiers, (ii) design optimal noise removal algorithms,
(iv) choose values for the free parameters of the algorithms,
etc. Although a few degradation models have been proposed in the
literature, none of them have been validated against real-world
degradations. The stumbling block has been that it is difficult
to map the [roblem "do these simulated, degraded images look
similar to the real, scanned images" into a quantitative
hypothesis-testing problem.
In this talk I will describe a statistical methodology that can
be used to validate document image degradation models. This method
is based on a non-parametric, two-sample permutation test. Another
standard statistical device -- the power function -- is then used
to choose between validation algorithm varialbes ( such as distance
functions). Since the validation and power function procedures
are independent of the model, they can be used to validate any
other degradation model. finally, I will describe a method for
comparing any two models. It uses p-values associated with the
estimated models to select the model that is closer to the real
world.
(This
work was done in collaboration with Professors R. M. Haralick
(EE,UW), Werner Stuetzle (Stat,UW), David Madigan (Stat,UW), and
Dr. Henry Baird (Bell Labs).)
|