LAMP - Media Group

About

Screenshots

Overview

Software

Documents

Publications

People

SYSTEM OVERVIEW

Functional

The BRIDGE system uses four different processes to generate XML output from original scanned dictionary images.

1.Aquisition and OCR:- We unbind the paper dictionary and scan the dictionary pages as TIF images. To perform OCR on the scanned images for the purpose of text extraction, we use ScanSoft SDK software.
2. Segmentation:- Segmentation is the process of dividing the image into different texual zones and to identify individual entries. The results of Segmentation are fedback to the system through "bootstrapping" to correct any errors.
3. Tagging:- The Tagging process is used to classify the dictionay entries into corresponding linguistic parts e.g. headword, translation, part of speech etc.
4. Generation:- The Generation process produces the results in different formats like XML, HTML etc.

Search and Retrieval:-
The Search and Retrieval utility is used to browse through the elecronically generated dictionary and look up for entries.

View a block diagram/flowchart of BRIDGE

Implementation

We have used Java to design the GUI and C++ for the backbone functions. The operating system for development is Windows XP/2000 and UNIX for different parts. Currently, the entire system is preinstalled on a desktop PC running Windows. The compilation and execution is done through a batch file.

The accuracy of the output depends on the quality of the scanning. Hence it should be ensured that the pages are scanned with the best quality output possible.

Note: In our environment, one language of the bilingual dictionary is English.

System Version: 0.1