|
Overview
Overview
Classification of zones into various syntactic categories such
as text, graphics, logo, etc. is an important subtask performed
by any generic OCR system. Automated techniques for training zone
classifiers are crucial because a) the test datasets keep changing
and automated algorithms can be easily adapted to the new datasets
by just retraining the algorithms, b) the algorithm is not governed
by subjective bias of an individual, c) these methods are quite
generic and can be employed for any classification problem.
A
decision tree based classifier has been implemented and tested
on the UW dataset. The classifier has 96% accuracy and has approximately
33% fewer misclassification errors than the University of Washington
algorithm.
- Feature
Extraction: Software has been written for extracting features
based on connected components. These features include mean and
standard deviation of component height, width, area and aspect
ratio; number of connected components; percentage of area covered
by connected components.
- Classifier:
A CART-based decision tree was trained on the University of
Washington Dataset.
- Evaluation:
The training and testing was done by dividing the dataset into
10 mutually exclusive subsets and training on 9 and testing
on 1, and then rotating the test and training sets.
|