Functional
The
BRIDGE system uses four different processes to generate
XML output from original scanned dictionary images.
1.Aquisition
and OCR:- We unbind the paper dictionary and scan the dictionary
pages as TIF images. To perform OCR on the scanned images
for the purpose of text extraction, we use ScanSoft SDK
software.
2. Segmentation:- Segmentation is the process of dividing
the image into different texual zones and to identify individual
entries. The results of Segmentation are fedback to the
system through "bootstrapping" to correct any
errors.
3. Tagging:- The Tagging process is used to classify the
dictionay entries into corresponding linguistic parts e.g.
headword, translation, part of speech etc.
4. Generation:- The Generation process produces the results
in different formats like XML, HTML etc.
Search
and Retrieval:-
The Search and Retrieval utility is used to browse through
the elecronically generated dictionary and look up for entries.
View a block diagram/flowchart of BRIDGE
Implementation
We
have used Java to design the GUI and C++ for the backbone
functions. The operating system for development is Windows
XP/2000 and UNIX for different parts. Currently, the entire
system is preinstalled on a desktop PC running Windows.
The compilation and execution is done through a batch file.
The
accuracy of the output depends on the quality of the scanning.
Hence it should be ensured that the pages are scanned with
the best quality output possible.
Note:
In our environment, one language of the bilingual dictionary
is English.
System
Version: 0.1
|