“Content-based video analysis, indexing and representation using multimodal information”
by Ying Li and C.-C. Jay Kuo
May 2003
In this dissertation, research is performed on video content analysis to achieve efficient video indexing, representation and retrieval. Multiple media cues, including various audio and visual information, have been employed to attain the research target. Specifically, the following three major system modules are developed in this work.
In the first module, video segmentation is performed to partition a video sequence into cascaded shots in both raw and compressed data domains. A novel commercial break detection scheme is also developed by integrating both audiovisual cues, which helps to remove non-story data and facilitate subsequent content analysis work.
In the second module, a sophisticated movie content analysis system is proposed which contains movie scene extraction, movie event extraction, and speaker identification three sub-schemes. Specifically, the purpose of detecting scenes and events is to reveal the underlying video semantics by extracting higher-level story units, while identified speaker information could be used for content indexing and retrieval. Three different types of movie events including 2-speaker dialogs, multiple-speaker dialogs, and hybrid events have been considered in this work. Moreover, to recognize interested movie casts, an adaptive speaker identification system is investigated where speakers' acoustic models are kept updating on the fly by adapting to their newly arriving data. Both audio and visual sources are exploited in the identification process, where the audio source is analyzed to recognize speakers using the likelihood-based approach and the visual source is examined to find talking faces with face detection/recognition and mouth tracking techniques.
In the third module, a video abstraction system is proposed which contains video summarization and video skimming two schemes. In particular, the scalable video summarization scheme is developed based on extracted hierarchical video scene structure which assigns different number of keyframes to different underlying video units based on their importance ranking. Moreover, to enable content browsing in form of a short video clip, we have also developed a video skimming system based on extracted event structure. Specifically, we select important events to form the final skim by evaluating its feature values and taking user's preferences under a given time constraint.