|
Overview
The increasing availability of online digital imagery and video
has rekindled interest in the problems of how to index multimedia
information sources automatically and how to browse and manipulate
them efficiently. Traditionally, images and video sequenced have
been manually annotated with a small number of keyword descriptors
entered after visual inspection by a human reviewer. Unfortunately
the process can be very time consuming and, although perhaps acceptable
for archiving applications, such delays will inhibit the ability
to perform near real-time filtering or retrieval. Furthermore,
the complexity the scenes and the language used to describe them
make it difficult to claim to have generated a complete'' description
of an image. Although this is not necessarily detrimental to indexing,
the limitations should be considered and weighed against other
options.
We
do find, however, that some information rich sources such as news
casts, commercials, and sporting events often contain text. The
ability to extract text from scene images or video can provide
important supplemental content information useful of indexing
and retrieval.
At
a high level, text can be divided into two classes, scene text
and graphic text. Scene text is text which appears on objects
in a image of the scene. This includes for example, writing on
signs or billboards, text on the sides of trucks or even writing
on tee-shirts. Although valuable, the appearance of such text
is typically incidental to the scene content, and would likely
be useful primarily in applications such as navigation, surveillance
or reading text off of know object, rather then general indexing
and retrieval.
Graphic
text, on the other hand, is text that is mechanically added to
a video, to supplement the visual and spoken information it contains.
For example, it may include anything from the time to a locations
to the name of a correspondent in a news cast. The descriptors
are typically concise.
Graphic
text has a number of functions which differ between domains. In
commercials, text appears to reinforce the vital information such
as the product name, claims made, or in some cases, to provide
disclaimers. In sporting events, text is used to identify specific
players, provide game information, or relay statistics. In news
casts, graphic text can be used to either identify key features
of the scene, such as location (white house lawn) or the speaker
(Bill and Hillary), to provide a synopsis of the topic (Blizzard
1997), or to provide a visual summary of statistical information.
In movies and television shows, text provides production and acting
credits, and in other cases captions or language translations.
The
data in video is usually much noisier and at a lower resolution
then text in static scene images, which makes extraction difficult.
The methods we use are based on the scanning window architecture
incorporated in a hybrid wavelet/neural network segmenter. Each
small windows of a video image is examined and is classified whether
it contains text or not.
|