“Advanced Techniques for Human Action Classification and Text Localization”
by Harshad Kadu
December 2015
This thesis contains two main research topics: 1) automatic human action classification with mocap data and 2) text localization in natural scene images. The common theme of these two topics is the application of pattern recognition techniques to multimedia information processing as detailed below.
Automatic classification of human action in motion capture (mocap) data has many commercial, biomechanical and medical applications and is the principal focus of the thesis. First, we propose a multi-resolution string representation scheme based on the tree-structured vector quantization (TSVQ) to transform the time-series of human poses into codeword sequences. Then, we take the temporal variations of human poses into account via codeword sequence matching. Furthermore, we develop a family of pose-histogram-based classifiers to examine the spatial distribution of human poses. We analyze the performance of the temporal and spatial classifiers separately. To achieve a higher classification rate, we merge their decisions and soft scores using novel fusion methods. The proposed fusion solutions are tested on a wide variety of sequences from the CMU mocap database using 5-fold cross validation, and a correct classification rate of 99.6% is achieved.
Searching for text regions in natural images is a challenging task for many computer vision applications. In the second part of our research, we propose a novel text localization scheme based on multi-stage incremental region classification technique. The stable extremal region detector investigates peculiar characteristics of text to discover regions with possible textual content across different channels. The coarse decision stump classifiers designed on geometric features and context-based text grouping stages efficiently remove false positives and outline the regions of interest. An ensemble of trained decision tree classifiers categorizes the remaining regions into text or non-text using the gradient profile features. Finally, the ROIs from different views and channels are fused together to procure a consolidated list of text regions. As per the experimental results, our suggested technique is among the top algorithms reported in the literature.