LAMP - Media Group - Research - Enhancement of Text in Video

About

People

Research

Publications

Seminars

Presentations

Courses

Recognition of people in low quality video

Overview

This research focuses on the problem of recognizing people in video, specifically of a collection of people participating in a video teleconference. We envision a situation in which a video conferencing facility in a building is being monitored by a computer vision system. A total population of O(100) people uses the conference facility, typically in small groups which memberships are stable over time. The goal is to use information about the appearance and movements of the people in the conference room at any time to recognize them against the small gallery.

We have set up a scenario in our conference room to collect some sample data of a video conferencing environment. The set up of the room is shown in Figure 1. We have collected data from three groups of four people. Images are take with Sony Handycam VX700. The distance between the camera and the people is about 20 feet. The size of the faces is about 30x30pixels, much smaller than typical face recognition problems.

We envision constructing a script of the conference. The script will include general information for each individual such as the direction of the gaze, and facial expression. The script will also detect and track specific activities such as speaking, taking notes, raising hands and other gestures which may occur in a conferencing environment. Further efforts will be made to identify a set of activities frequently seen in a video conferencing environment. We shall rely on previous video segmentation research.

We are interested in addressing the challenges posed by video data streams obtained through low bit-rate coders(e.g. JPEG) with low resolution. Our plans are to take advantage of multiple cues such as the face, the sex, the color of clothes, membership of groups to aid in the recognition process. Based on factors such as body pose, position in the room, etc. the recovery of some of these attributes will be harder than others. We plan to first identify the attributes, which are easily recognizable and use them to prune the search space for attributes, which are more harder to identify. A reasoning module will complete the recognition by weighing the amount of probabilistic evidence from each cue.

The first sub problem we are studying is the accuracy with which we can recognize the faces of the participants in video. We employ an approach based on Eigen faces, but extended with robust estimation and tracking of the face. We start by doing a principal component analysis (PCA) of the individual faces to form a face database. We then use Eigen tracking as described in [1] to track the faces. The tracking can accommodate some degree of affine transformations such as translation, rotation, and scaling as well as occlusion of the targets. Experiments are being conducted to determine how much spatial resolution is needed, for both database construction and tracking/recognition, to achieve a given level of performance in face recognition.

We have collected data for three groups of four people each. Around one minute of data is collected for each seating arrangement totaling about 15 minutes of data altogether. We tested how well the Eigen tracking program does on recognizing and tracking the targets (faces) in front pose as a function of database and test image resolution. Some initial results are illustrated in Figure 2, where we track a person's head in motion and recognize here over the video sequence. Result shows that the face at frontal views can be tracked and recognized reasonably well.

In the previous experiment, we hand initialized the region to be tracked. Currently, we are trying to automatically segment the faces. We start by integrating the intensity difference between frames to find the approximate regions where the human subjects are located. The program is expected to segment the regions with intensity changes after processing a sequence for a few seconds Figure 3. Spatial and temporal smoothing is being applied to reduce noise from the fluctuation of fluorescent lighting. Our next step is to combine skin color and/or room model information in the segmentation process.