Upgrade & Performance Evaluation of Video Frame Ranking Toolbox |
|
Comparative evaluation of ranking techniques against user defined ground truth Hypothesis The hypothesis about user behavior is as follows: Typically, if there are 10 shots and the storyboard can contain only 10 thumbnails, the user would select one frame per shot to make the summary. If the storyboard can contain 20 thumbnails, the user would still select one frame per shot for static shots (since two identical frames do not provide any additional information), but will be able to use more than one frame to summarize very dynamic shots in which the content at the beginning of the shot is different from the content at the end of the shot. This is comparable to the actual observed behavior of the automatic summaries produced by the FRT when color feature vectors are used. Therefore, the hypothesis for the study is that the FRT provides a good model of user summarization. Comparative Analysis The storyboards produced by the user for a given video clip are taken as ground truth of summaries for that clip. However, we do expect different users to produce different results for the same clip. Therefore, an automatic summary for a given clip will be compared to all the available ground truths, with a similarity measure produced for each. The mean and standard deviation of these measures will be computed. Comparing summarization algorithms will consist of comparing these means and standard deviations for a set of video clips for which ground truth is available. Similarity measure between a ground truth summary and an automatic summary For the similarity measure, we are adopting concepts developed for the performance evaluation of data retrieval from databases. A ground truth summary defines two classes of frames among the frames of a video: Class A: Frames that belong to the ground truth summary Class B: Frames that do not belong the ground truth summary. If we assume that the ground truth is indeed the only truth , and that therefore frames from Class A should belong to a good summary and frames from Class B should not, then the task of an automatic video summarizer is to retrieve as many Class A frames as possible and as little Class B frames as possible. An automatic summarizer divides frames of the video into the following 4 categories: Assigned correctly (AC): Number of Class A frames that were picked up by the summarizer Assigned incorrectly (AI): Number of Class B that were picked up by the summarizer Unassigned incorrectly (UI): Number of A that were left alone UI = A - AC Unassigned correcly (UC): number of B that were left alone We can define the following measures: Accuracy = percentage of correct decisions = (AC +UC)/(A + B) Error = percentage of incorrect decisions = (AI + UI)/(A + B) (with A and B numbers of frames of class A and B) Precision = p = AC / (AC + AI) Recall = r = AC / (AC + UI) f1 measure = 2 p r / (p + r) We will mainly use the f1 measure. This measure is equal to 1 for a perfect summary and to 0 for a summary with no Class A frames. Matching between ground truth and automatic summary Note that an automatic summary can be 100% correct even if none of the frame indices exactly match the ground truth summary. The reason is that in static shots, frames are almost identical, so that a frame may be as good for the summary as the next one. Therefore, for each automatic summary frame we have to find if there exist a close enough ground truth frame that matches it, and has not been matched yet. This is a typical assignment problem that we solve using the classic Hungarian Assignment algorithm. A frame of the automatic summary that can't find a close match to the frames of the ground truth summary must be one of the frames of class B and is therefore counted in B. |
For problems or questions regarding this web contact [Ayesh
Mahajan].
|