“A Deep Learning Approach to Online Single and Multiple Object Tracking”
by Weihao Gan
May 2018
Online object tracking is one of the fundamental computer vision problems. It is commonly used in real world applications such as traffic control and safety in video surveillance, autonomous vehicle, robotic navigation, medical imaging, etc. It is a very challenging problem due to multiple time-varying attributes in video sequences. One widely adopted online object tracking framework is tracking-by-detection (TBD), where tracking is treated as a detection problem. This strategy exploits the spatial information of the image content. In this research, we investigate two different kinds of tracking problems: single object tracking (SOT) and multiple object tracking (MOT). First, we attempt to achieve online single object tracking using both spatial and motion cues with two novel methods. Second, from the proposed SOT technique, we build an online multiple object tracking system with advanced model update and matching. First, we develop a new method, called the ''temporal prediction and spatial refinement (TPSR)" tracker, to integrate spatial and temporal cues effectively. The TPSR tracking system consists of three cascaded modules: pre-processing (PP), temporal prediction (TP) and spatial refinement (SR). Illumination variation and shaking camera movement are two challenging factors in a tracking problem. They are compensated in the PP module. Then, a joint region-based template matching (TM) and pixel-wised optical flow (OF) scheme is adopted in the TP module, where the switch between TM and OF is conducted automatically. These two modes work in a complementary manner to handle different foreground and background situations. Finally, to overcome the drifting error arising from the TP module, the bounding box location and size are finetuned using the local spatial information of the new frame in the SR module. Next, we apply the deep neural network architecture to the online object tracking problem. The proposed method is called "Motion-Guided Convolutional Neural Network (MGNet) Tracker", which is built upon the multi-domain convolutional neural network (MDNet) with two innovations: 1) adoption of a motion-guided candidate selection (MCS) scheme based on a dynamic prediction model, and 2) usage of a RGB-plus-motion 5-channel input to the convolutional neural network (CNN). For the former, a dynamic motion model is adopted to estimate the probability distribution of candidate's location, width and height. As a result, the MGNet can generate candidates more accurately and efficiently. For the latter, we add the horizontal and vertical optical flow fields to the original RGB three channels to form a 5-channel input so that the motion information is exploited explicitly rather than implicitly by the CNN. We compare the performance of the MGNet, the MDNet and several state-of-the-art online object trackers on the OTB and the VOT benchmark datasets, and demonstrate that the temporal motion correlation between any two consecutive frames in videos can be more effectively captured by the MGNet via extensive performance evaluation. Finally, we start to explore the multiple object tracking (MOT) system based on the CNN single object tracker. The proposed method is called "Online CNN-based Multiple Object Tracking with Enhanced Model Updates and Identity Association". This method treats the MOT problem as an online tracking problem, rather than the global optimization framework. There are three major components in this tracking system: 1) a system platform built upon multiple CNN single object trackers in MOT environment; 2) the proposed advanced online update strategy including incremental and refresh update mode; 3) a confirmation process for identity matching based on multiple level feature representations. We evaluate our proposed framework on the commonly used multiple object tracking dataset - MOTchallenge, and rank the top 1 position in accuracy/precision/IDswitch/Fragment among all the online MOT tracking methods. Extensive experiments show that the proposed online update strategy is crucial to train an accurate target tracker and control the error drifting in the future.