The USC Andrew and Erna Viterbi School of Engineering USC Signal and Image Processing Institute USC Ming Hsieh Department of Electrical and Computer Engineering University of Southern California

Technical Report USC-SIPI-441

“Large-Scale Scene Classification Using Machine Learning Techniques”

by CHEN CHEN

December 2016

Large-scale scene understanding is one of the basic computer vision problems. It finds applications in robotic navigation, image/video indexing, archiving and retrieval, etc. In this research, we attempt to achieve this objective using advanced machine learning techniques. The following three topics are investigated: 1) the integration of multiple cues with data grouping and decision stacking for indoor/outdoor scene classification, 2) the use of reliable semantic segmentations for multiple outdoor scene classification, and 3) the advances of scene anchor vector in analyzing and resolving scene class con- fusions. They are elaborated below.

For the first topic, we exploit the diversity of visual data and feature strengths to combine multiple visual cues for robust indoor/outdoor scene classification and propose an Expert Decision Fusion (EDF) solution for this task. In EDF, we present two key ideas, namely, data grouping and decision stacking. By data grouping, we partition the entire data space into multiple disjoint sub-spaces so that a more accurate prediction model can be trained in each sub-space. After data grouping, the EDF system integrates soft decisions from multiple classifiers (called experts) through stacking so that multi- ple experts can compensate each other’s weakness. The proposed EDF system offers more accurate and robust classification performance since it can handle data diversity effectively and benefit from data abundance in large-scale datasets. The advantages of data grouping and decision stacking are explained and demonstrated in detail in this dissertation.

For the second topic, we propose a Coarse Semantic Segmentation (CSS) approach to scene understanding with the help of semantic content recognition. Unlike traditional scene understanding algorithms, where basic learning units such as pixels, patches and super-pixels are usually used, CSS obtains reliable image-adaptive learning units (rough segments) in an image. Taking the advantages of the reliable segmentation results, CSS uses a robust context-aware labeling scheme to associate relevant semantic visual words to each segment. In this dissertation, we show the significances of CSS’s efficiency by applying it to the outdoor scene classification problem. We will demonstrate the superior performance of CSS using extensive experimental results.

For the last topic, we introduce the Scene Anchor Vector (SAV) concept to explain the source of scene class confusions. An SAV points to a cluster of images. If two anchor vectors have a smaller angle, we see overlapping image clusters, leading to a set of confusing classes. To overcome it, we propose to merge images associated with confusing anchor vectors into a confusion set and split the set in an unsupervised fashion to create multiple subsets. It is called the “automatic subset clustering (ASC)” process. Each of these subsets contains scene images of strong visual similarity. After the ASC process, we train a random forest (RF) classifier for each confusion subset to allow better scene classification. The ASC/RF scheme can be added on top of any existing scene-classification CNN as a post-processing module with little extra training effort. It is shown by extensive experimental results that, for a given baseline CNN, the ASC/RF scheme can offer a significant performance gain.

To download the report in PDF format click here: USC-SIPI-441.pdf (20.0Mb)