The USC Andrew and Erna Viterbi School of Engineering USC Signal and Image Processing Institute USC Ming Hsieh Department of Electrical and Computer Engineering University of Southern California

Technical Report USC-SIPI-452

“Local-Aware Deep Learning: Methodology and Applications”

by Heming Zhang

Deep learning techniques utilize networks with multiple layers cascaded to map the inputs to desired outputs. To map the entire inputs to desired outputs, useful information should be extracted through the layers. During the mapping, feature extraction and prediction are jointly performed. We do not have direct control for feature extraction. Consequently, some useful information, especially local information, is also discarded in the process. In this thesis, we specifically study local-aware deep learning techniques from four different aspects: 1. Local-aware network architecture 2. Local-aware proposal generation 3. Local-aware region analysis 4. Local-aware supervision Specifically, we design a multi-modal attention mechanism for generative visual dialogue system in Chapter 2. The visual dialogue system holds a dialogue between human and machine. A generative visual dialogue system takes an image, a sentence in one round of dialogue and the dialogue in the past rounds as inputs, and generates the corresponding response to continue the dialogue. Our proposed local-aware network architecture is able to simultaneously attend to those multi-modal inputs and utilize extracted local information to generate dialogue responses. We propose a proposal network for fast face detection system for mobile devices in Chapter 3. A face detection system on mobile devices has many challenges including high accuracy, fast inference speed and small model size due to limited computation power and storage space of mobile devices. Our proposed local-aware proposal generation module is able to detect salient facial parts and use them as local cues for detection of entire faces. It accelerates the inference speed and does not result in much burden on model size. We extract representative fashion features by analyzing local regions in Chapter 4. Many fashion attributes, such as the shape of the collar, the length of the sleeves, the pattern of the prints, etc.can only be found in local regions. Our proposed local-aware region analysis extracts representative fashion features from different levels of the deep network, so that the fashion features extracted contain many local fashion details of human’s interests. We develop a fashion outfit compatibility learning method with local graphs in Chapter 5. When modeling a fashion outfit as a graph, the network that learns the compatibility on the entire outfit graphs only may ignore some subtle differences among outfits. Our proposed local-aware supervision includes the construction of local graphs and the corresponding local loss function. The local graphs are constructed from partial outfits. Then the network trained with the local loss function on the local graphs is able to learn the subtle difference of compatibility in fashion outfits data.


This report is not currently available in PDF format for downloading. Contact the Signal and Image Processing Institute for information on its availability.