The USC Andrew and Erna Viterbi School of Engineering USC Signal and Image Processing Institute USC Ming Hsieh Department of Electrical and Computer Engineering University of Southern California

Technical Report USC-SIPI-451

“Visual Knowledge Transfer With Deep Learning Techniques”

by Junting Zhang

The classical machine learning paradigm rarely exploits the dependencies and relations among different tasks and domains. In the deep learning era, manually creating a labeled dataset for each task becomes prohibitively expensive. In this dissertation, we aim to develop effective techniques to retain, accumulate, and transfer knowledge gained from past learning experiences to solve new problems in new scenarios. Specifically, we consider four different types of knowledge transfer scenarios in computer vision applications: 1) incremental learning—we transfer knowledge from old task(s) to new task as training data become available gradually over time; 2) domain adaptation—we obtain the knowledge from labeled training data in one domain, and then transfer and apply it in another domain; 3) knowledge transfer across applications to improve robustness of the target application; 4) knowledge transfer in spatiotemporal domain—we perform pixel-wise tracking for multiple objects in a video sequence given the annotations for the first frame. Existing incremental learning (IL) approaches tend to produce a model that is biased towards either the old classes or new classes, unless with the help of exemplars of the old data. To address this issue, we propose a class-incremental learning paradigm called Deep Model Consolidation (DMC), which works well even when the original training data is not available. The idea is to train a model on the new data, and then combine the two individual models trained on data of two distinct set of classes (old classes and new classes) via a novel dual distillation training objective. The two existing models are consolidated by exploiting publicly available unlabeled auxiliary data. This overcomes the potential difficulties due to the unavailability of original training data. Compared to the state-of-the-art techniques, DMC demonstrates significantly better performance in CIFAR-100 image classification and PASCAL VOC 2007 object detection benchmarks in the IL setting. In the second work, a domain adaptation method for urban scene segmentation is proposed. We develop a fully convolutional tri-branch network, where two branches assign pseudo labels to images in the unlabeled target domain while the third branch is trained with supervision based on images in the pseudo-labeled target domain. The re-labeling and re-training processes alternate. With this design, the tri-branch network learns target-specific discriminative representations progressively and, as a result, the cross-domain capability of the segmenter improves. We evaluate the proposed network on large-scale domain adaptation experiments using both synthetic (GTA) and real (Cityscapes) images. It is shown that our solution achieves state-of-the-art performance and it outperforms previous methods by a significant margin. Scene text detection is a critical prerequisite for many fascinating applications, and we choose it as an example application to explore the possibility of transferring knowledge across applications. Existing methods detect texts either using the local information only or casting it as a semantic segmentation problem. They tend to produce a large number of false alarms or cannot separate individual words accurately. In this work, we present an elegant segmentation-aided text detection solution that predicts the word-level bounding boxes using an end-to-end trainable deep convolutional neural network. It exploits the holistic view of a segmentation network in generating the text attention map (TAM); it then uses the TAM to refine the convolutional features for the MultiBox detector through a multiplicative gating process. We conduct experiments on the large-scale and challenging COCO-Text dataset and demonstrate that the proposed method outperforms state-of-the-art methods significantly. We also study the knowledge transfer in the spatiotemporal domain for video understanding. Semi-supervised video object segmentation is a pixel-wise tracking problem to propagate the mask annotations from the first frame throughout the full video. We propose to aggregate pixel features into region features via soft superpixel clustering, and then build a spatiotemporal graph for the regions extracted from adjacent frames. A graph neural network is designed to reason in the 3D space and refine the features associated with each node via message passing. The segmentation masks are estimated by predicting the node labels for the query frame and re-project the node labels back to the pixel spaces. The proposed system is end-to-end trainable and involves only one forward pass of the network at test time. Our method achieves comparable accuracy as the state-of-the-art on the DAVIS 2017 benchmark with much fewer computations and less memory consumption.


This report is not currently available in PDF format for downloading. Contact the Signal and Image Processing Institute for information on its availability.