Overview
This research focuses on the problem of recognizing people in
video, specifically of a collection of people participating in
a video teleconference. We envision a situation in which a video
conferencing facility in a building is being monitored by a computer
vision system. A total population of O(100) people uses the conference
facility, typically in small groups which memberships are stable
over time. The goal is to use information about the appearance
and movements of the people in the conference room at any time
to recognize them against the small gallery.
We
have set up a scenario in our conference room to collect some
sample data of a video conferencing environment. The set up of
the room is shown in Figure 1. We have collected data from three
groups of four people. Images are take with Sony Handycam VX700.
The distance between the camera and the people is about 20 feet.
The size of the faces is about 30x30pixels, much smaller than
typical face recognition problems.
We
envision constructing a script of the conference. The script will
include general information for each individual such as the direction
of the gaze, and facial expression. The script will also detect
and track specific activities such as speaking, taking notes,
raising hands and other gestures which may occur in a conferencing
environment. Further efforts will be made to identify a set of
activities frequently seen in a video conferencing environment.
We shall rely on previous video segmentation research.
We
are interested in addressing the challenges posed by video data
streams obtained through low bit-rate coders(e.g. JPEG) with low
resolution. Our plans are to take advantage of multiple cues such
as the face, the sex, the color of clothes, membership of groups
to aid in the recognition process. Based on factors such as body
pose, position in the room, etc. the recovery of some of these
attributes will be harder than others. We plan to first identify
the attributes, which are easily recognizable and use them to
prune the search space for attributes, which are more harder to
identify. A reasoning module will complete the recognition by
weighing the amount of probabilistic evidence from each cue.
The
first sub problem we are studying is the accuracy with which we
can recognize the faces of the participants in video. We employ
an approach based on Eigen faces, but extended with robust estimation
and tracking of the face. We start by doing a principal component
analysis (PCA) of the individual faces to form a face database.
We then use Eigen tracking as described in [1] to track the faces.
The tracking can accommodate some degree of affine transformations
such as translation, rotation, and scaling as well as occlusion
of the targets. Experiments are being conducted to determine how
much spatial resolution is needed, for both database construction
and tracking/recognition, to achieve a given level of performance
in face recognition.
We
have collected data for three groups of four people each. Around
one minute of data is collected for each seating arrangement totaling
about 15 minutes of data altogether. We tested how well the Eigen
tracking program does on recognizing and tracking the targets
(faces) in front pose as a function of database and test image
resolution. Some initial results are illustrated in Figure 2,
where we track a person's head in motion and recognize here over
the video sequence. Result shows that the face at frontal views
can be tracked and recognized reasonably well.
In
the previous experiment, we hand initialized the region to be
tracked. Currently, we are trying to automatically segment the
faces. We start by integrating the intensity difference between
frames to find the approximate regions where the human subjects
are located. The program is expected to segment the regions with
intensity changes after processing a sequence for a few seconds
Figure 3. Spatial and temporal smoothing is being applied to reduce
noise from the fluctuation of fluorescent lighting. Our next step
is to combine skin color and/or room model information in the
segmentation process.
|