Overview
------------
The JargonDOCLIB add-on implements a machine text script identification algorithm. This determination is done directly from the image, prior to OCR. Knowledge of the script of an image can be useful in filtering and routing of text image to followon processing. JARGON also attempts to classified the orientation of the image, although because this classification is highly biased to the upright direction, it may be better to use a different technique.
While the JargonDOCLIB add-on contains all of the software needed to retrain the algorithm on desired scripts, this package includes two pretrained set of data files. One of these allows for script identification amongst a set of 13 scripts (Amharic, Arabic, Armenian, Burmese, Chinese, Cyrillic, Devanagari, Greek, Hebrew, Japanese, Korean, Latin, and Thai). The other will discriminate between Latin and Arabic scripts in all four directions. In this case the JARGON script orientation classification capability is not used, rather, the non-upright Latin and Arabic scripts are treated as separate scripts during training.
JARGON uses a template matchingn technique to discriminate between scripts. JARGON also includes an "unknown" classification. The template matching technique serving as a basis for JARGON was initially developed at Los Alamos National Laboratory (see Automatic Script Identification from Images using Cluster-based Templates, Proc., 3rd Int'l Conf. on Document Analysis and Recognition, pp 378-381, Montreal, Canada, August 14-16, 1995).
There are three major components (programs) currently used to train and then run script identification. Initially, the JargonTrainingDriver program is used to cluster connected components occuring within a training set of images. The connected components from each training image for a script are extracted and are rescaled to a common size, currently 20x20. Each comoponent is then compared to the clusters already formed. If it is significantly different from the existing clusters, it becomes the basis for a new cluster. Otherwise it is added to and existing cluster. The metric used to measuer similarity is the hamming distance. Currently, for 20x20 sized components, a hamming distance threshold of 111 is used. After the clustering process is completed, the clusters( which are greyscale because of averaging) ar thresholded at 50% into binary values. This process is repeated for each script.
The second part of training, performed by the JargonReliabilityDriver program, consists of assigning a metric to each cluster in each of the scripts. It is referred to as the "reliability" number for the script. This reliability number serves as a measure of the uniqueness of a symbol to a script. It is computed by, for each connected component in all the training images for all the scripts, finding the best match cluster. The reliability number for a cluster is then defined as the ratio of the number of symbols best matching the cluster that are of that clusters script divided by the total number of symbols best matching the cluster ( matches to cluster that are of clusters script/total matches)
Once the first two programs have been run to generate the data files necessary to perform script identification, the JargonScriptIDDriver can be used to perform script identification for an unknown image. This program randomly selects a set of connected components from an image. The components are filtered based on density, location on the page, and size. For each of the remaining symbols, the best match from amongst all the clusters for each script are computed. During this process a weighted sum is computed for each of the scripts. For each best fit computed, the sum for the corresponding script is incremented by the reliability number corresponding to the best fit cluster. Once all of the selected symbols from the image being classified have been computed the distribution of the scores for each script is used to determing the image script classification. In general the script with the highest score is chosen. If, however, no one script scores strongly (the standard deviation across scripts is less then 1.0) the image may be classified as unknown. At this point 200 symbols are being used to classify an images. This is a configurable parameter. To select an orientation for an image, JARGO N computes the script scores for all four directions for the first 50 components. If one direction is showing a substantially higher standard deviation, indicating tha tone or two scripts are clearly beginning to emerge as predominant, that direction is selected. Otherwise JARGON assumes the image is upright.
Input Image Format
-------------------
This algorithm operates in the binary domain, however, the software will automatically use the DOCLIB functions DLBitsPerPixelConverter::DLBitsPerPixelConverter::dlConvertImage to convert color or grayscale images to binary. At this point there is no way to specify or select an a lternate color/grayscale binary conversion. If another method is desired the images should be converted prior to providing them to the DLLogoTokenMatchDOCLIB functions.