|
Overview
An important aspect of video analysis is an ability to represent
its high-level structure for classification, indexing and retrieval.
Such a representation should contain, for example, information
about interactions between shots, classification of shots, transitions
between shots and classification of groups of shots based on activity.
We have previously developed a novel technique which reduces a
sequence of MPEG-encoded video frames directly from the compressed
domain to a trail of points in a low-dimensional space -- a VideoTrail.
We use the DC coefficients of each frame as features to generate
a low-dimensional point representing that frame, using a technique
called FastMap for dimensionality reduction.
In
the low-dimensional space, we cluster frames, analyze transitions
between clusters and compute properties of the resulting trail
efficiently. By classifying portions of the trail as either stationary
or transitional, we are able to detect gradual edits between scenes.
We split a VideoTrail by identifying regions in the sequence of
points where we have stability in the video, and cutting in between
them, thus providing a more robust analysis than traditional approaches
which examine only local changes between frames. By tracking the
interaction of clusters over time, we lay the groundwork for the
complete analysis and representation of the video's physical and
semantic structure.
The
recent progress in this area has been on the classification of
video trails, both into transitional and non-transitional classes,
and as higher level patterns. We have recently begun looking at
how HMMs can be used for classification.
Videos
are a visual language. HMMs are the most successful tools in speech
recognition, with levels of application ranging from phoneme recognition
to content classification to automatic translation.
Similarly,
in the analysis of a video stream, HMMs would be useful at several
levels:
- Classify
transitions (fades, dissolves) from scenes.
- Classify
story elements: dialogue sequences, close-ups, outdoor scenery.
- Classify
video clips: News, advertising, weather, sports show, ...
Compared
to an approach that would try to define rules for what a dissolve
should look like compared to a camera pan, or how a news clip
differs from a nature show, a statistical approach seems to
provide a more powerful and more flexible framework.
Film
makers express the structure of films with specific terms. These
terms have been chosen to describe syntactic units in the story
line. These syntactic units could correspond to states in the
state transition network of a high-level HMM parser.
Related
statistical methods can be used to answer the following question:
What are the most discriminatory features that can be used to
best classify N video clips among n categories. Answering this
question is important because it will tell us what type of content
detection we should focus on for maximum discrimination.
|