LAMP - Media Group - Research - Enhancement of Text in Video

About

People

Research

Publications

Seminars

Presentations

Courses

Extraction of Text from Video

Overview

The increasing availability of online digital imagery and video has rekindled interest in the problems of how to index multimedia information sources automatically and how to browse and manipulate them efficiently. Traditionally, images and video sequenced have been manually annotated with a small number of keyword descriptors entered after visual inspection by a human reviewer. Unfortunately the process can be very time consuming and, although perhaps acceptable for archiving applications, such delays will inhibit the ability to perform near real-time filtering or retrieval. Furthermore, the complexity the scenes and the language used to describe them make it difficult to claim to have generated a complete'' description of an image. Although this is not necessarily detrimental to indexing, the limitations should be considered and weighed against other options.

We do find, however, that some information rich sources such as news casts, commercials, and sporting events often contain text. The ability to extract text from scene images or video can provide important supplemental content information useful of indexing and retrieval.

At a high level, text can be divided into two classes, scene text and graphic text. Scene text is text which appears on objects in a image of the scene. This includes for example, writing on signs or billboards, text on the sides of trucks or even writing on tee-shirts. Although valuable, the appearance of such text is typically incidental to the scene content, and would likely be useful primarily in applications such as navigation, surveillance or reading text off of know object, rather then general indexing and retrieval.

Graphic text, on the other hand, is text that is mechanically added to a video, to supplement the visual and spoken information it contains. For example, it may include anything from the time to a locations to the name of a correspondent in a news cast. The descriptors are typically concise.

Graphic text has a number of functions which differ between domains. In commercials, text appears to reinforce the vital information such as the product name, claims made, or in some cases, to provide disclaimers. In sporting events, text is used to identify specific players, provide game information, or relay statistics. In news casts, graphic text can be used to either identify key features of the scene, such as location (white house lawn) or the speaker (Bill and Hillary), to provide a synopsis of the topic (Blizzard 1997), or to provide a visual summary of statistical information. In movies and television shows, text provides production and acting credits, and in other cases captions or language translations.

The data in video is usually much noisier and at a lower resolution then text in static scene images, which makes extraction difficult. The methods we use are based on the scanning window architecture incorporated in a hybrid wavelet/neural network segmenter. Each small windows of a video image is examined and is classified whether it contains text or not.