This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

PatchTrack: Multiple Object Tracking Using Frame Patches

Xiaotong Chen
Computer Science
UC, Santa Barbara
[email protected]
The work is done during an internship at Appen
   Seyed Mehdi Iranmanesh
Amazon
[email protected]
The work is done while at Appen
   Kuo-Chin Lien
Appen
[email protected]
Abstract

Object motion and object appearance are commonly used information in multiple object tracking (MOT) applications, either for associating detections across frames in tracking-by-detection methods or direct track predictions for joint-detection-and-tracking methods. However, not only are these two types of information often considered separately, but also they do not help optimize the usage of visual information from the current frame of interest directly. In this paper, we present PatchTrack, a Transformer-based joint-detection-and-tracking system that predicts tracks using patches of the current frame of interest. We use the Kalman filter to predict the locations of existing tracks in the current frame from the previous frame. Patches cropped from the predicted bounding boxes are sent to the Transformer decoder to infer new tracks. By utilizing both object motion and object appearance information encoded in patches, the proposed method pays more attention to where new tracks are more likely to occur. We show the effectiveness of PatchTrack on recent MOT benchmarks, including MOT16 (MOTA 73.71%, IDF1 65.77%) and MOT17 (MOTA 73.59%, IDF1 65.23%). The results are published on https://motchallenge.net/method/MOT=4725&chl=10.

Refer to caption
(a) Tracking using tracking queries (output embeddings of the previous frame).
Refer to caption
(b) Detection pre-trained to detect image patches.
Refer to caption
(c) Tracking using patch queries and frame patches.
Figure 1: Inspiration of PatchTrack. MOT method [35, 24] uses output embeddings of the previous frame as track queries (colors represent tracking IDs) to propagate existing objects and object queries to detect new objects entering the camera view (1(a)). Detection model [6] is pre-trained to locate image patches by adding their features to object queries (1(b)). Proposed system that uses both track queries and features of frame patches predicted by a motion model to predict locations of corresponding objects (1(c)).

1 Introduction

Multiple object tracking (MOT) concerns identifying objects of interest and tracking their moving trajectories in video sequences. Intuitively, successful MOT algorithms need to be able to handle subtle appearance differences between multiple tracked objects and resolve the ambiguity via other cues, such as motion, when the targets are visually indistinguishable.

With the powerful appearance encoding capability of CNN, the tracking-by-detection paradigm dominates MOT methods in the past decade [5, 51, 44]. Highly accurate CNN-based object detection [30, 31, 4] is first performed in all frames independently, and then association of these detected objects across frames is performed to establish tracks of consistent object IDs. In the association step, locations of existing tracks in the following frame may be predicted from assumption (constant velocity, acceleration, etc.) or other motion models [51, 34, 43, 44] and then associate with detections based on metrics like intersection-over-union (IoU).

Joint-detection-and-tracking methods [53, 35, 45] recently demonstrate superior accuracy. The idea is simultaneously performing object detection and tracking so both tasks enjoy information shared from the other. This is particularly intriguing in Transformer-based architectures where output feature embeddings of previous frames are used as ‘track queries’, along with ‘object queries’ for Transformer decoder, predicting corresponding tracks as well as newly discovered objects in the current frame (Figure 1(a)). Albeit achieving state of the art MOT results, we argue that these architectures overly rely on appearance. As the information encoded in track queries is strictly limited to previous frames, the Transformer model needs to infer both object offset and object appearance in the current frame.

To resolve above problem, we take inspiration from UP-DETR [6], an object detection model that is pre-trained to detect image patches (Figure 1(b)) using patch features, and propose a MOT system that uses frame patches from the current frame of interest. We first use a motion model to predict new locations of existing tracks in the current frame from the previous frame, and crop the current frame to patches based on the prediction. These patches, with implicit prior knowledge of object motion and explicit information of object appearance in the current frame, are sent to the decoder to predict new locations of existing tracks in the current frame.

More specifically, we present PatchTrack (Figure 1(c)), which is a Transformer-based joint-object-detection-and-tracking system that predicts tracks in the current frame of interest from its patches. We use the Kalman filter [43] to obtain track candidates in the current frame from existing tracks in the previous frame and crop the current frame using the bounding box of these candidates to get patches. Both the current frame and these patches are sent into our convolutional neural network (CNN) [13] backbone that outputs the frame feature and the patch queries respectively. Each pair of track query, from the output embeddings when processing the previous frame, and patch query with the same tracking ID are added together to form the corresponding patch-track query. These patch-track queries are sent to the decoder along with object queries, where the former is used to predict new locations of existing tracks, while the latter is used to detect new objects in the current frame.

We evaluate PatchTrack on MOT benchmarks and achieve competitive results on MOT16 (MOTA 73.71%, IDF1 65.77%) and MOT17 (MOTA 73.59%, IDF1 65.23%) test sets. To the best of our knowledge, our method is the first that uses patches of the current frame of interest to infer both object motion and appearance information simultaneously. We hope it could provide a new perspective for designing MOT systems.

In summary, our contributions are:

  • A Transformer-based MOT system, namely PatchTrack, which jointly performs object detection and tracking.

  • A novel way of optimizing the usage of visual information by utilizing patches from the current frame of interest.

  • Introduction of patch-track queries that incorporate both knowledge of the object motion and object appearance in the current frame of interest to facilitate tracking.

2 Related Work

2.1 Object detection and tracking

Object detection concerns locating and/or classifying objects of interest in a single image. As the preliminary to object tracking, there is a close connection between the two. Many popular object detection methods generate detections from hypothesis of object locations, including regional proposals [12, 11, 31, 4] and anchors/object centers [30, 22, 54]. On the other hand, there is an increasing number of object tracking systems that utilize Transformer [39], which has shown success in object detection [5, 25, 56, 6] before. Transformer-based object-detection methods encode the CNN [13] feature of images and decodes learned object queries to obtain detections. Aside from architecture adjustment [25, 56] from the original DETR [5], we also see modification to object queries [6] using image patches to facilitate detection. Inspired by the usage of regional proposal and image patches, our proposed method uses frame patches, which can be considered as our initial guess of track locations and appearance.

2.2 Tracking-by-detection

One major paradigm in MOT is tracking-by-detection, where the MOT systems [5, 51, 44] first obtain detections for each frame and then associate them across frames to form tracks. Since the object detection is a standalone step in the tracking process, one benefit of tracking-by-detection methods is the flexibility to pair different object detection models [31, 30, 5] with different association strategies, thus be benefited directly from advancement in the area of object detection. On the other hand, the object detection step omits information across frames as each of them is processed separately by the detector.

Object motion and appearance may only be considered as part of the detection association strategy for these methods [51, 34]. For object motion, Kalman filter [43] is one of the most popular algorithm used to propagate detections in previous frame to predict their location in its future frame. Combined with Hungarian algorithm [16] and intersection-over-union (IOU) metrics, it has proven to be an effective tracking mechanism [3]. Object appearance information like Re-ID features [44, 28, 51] are also commonly used as similarity measures.

2.3 Joint-detection-and-tracking

The other popular paradigm in MOT is joint-detection-and-tracking, where the object detection and object tracking are performed simultaneously [53, 35]. One advantage of joint-detection-and-tracking methods is the accessibility to information across frames. For instance, features of multiple frames can be used at once [53, 35, 45] to facilitate detection and/or tracking. For Transformer-based joint-detection-and-tracking methods, both the encoder and the decoder may take additional information from previous frames to infer predictions of the current frame of interest [35, 24, 50]. Specifically, recent works have introduced track queries [35, 24], which come from the output embeddings when processing previous frames. Depends on the design, the track queries may be decoded to bounding boxes separately from the object queries [35] and matched together to predict new tracks, or processed together to form new tracks directly [24].

Refer to caption
Figure 2: PatchTrack. We first use Kalman filter [43] to predict track candidates in frame fkf_{k} from tracks in frame fk1f_{k-1}. Both frames are sent to the CNN backbone that produces frame features for the Transformer encoder. We crop fkf_{k} to patches using bounding boxes of track candidates and send them to the CNN backbone, followed by a fully connected layer (FC) and global average pooling (GAP), to get patch queries that align with track queries. Patch queries are added to track queries to form patch-track queries, which are then sent to the Transformer decoder along with object queries. The patch-track queries are decoded to output embeddings that refine locations of track candidates and the object queries are decoded to output embeddings that detect new objects. Output embeddings that corresponds to tracks in fkf_{k} are the track queries for processing fk+1f_{k+1}.

3 Method

In this section, we describe the architecture of PatchTrack (Section 3.1), how object tracking is initialized (Section 3.2), how existing tracks are propagated to form new track candidates (Section 3.3), and how frame patches are generated to help facilitate object tracking(Section 3.4).

3.1 Architecture

PatchTrack is a Transformer-based joint-detection-and-tracking system. The Transformer encoder takes in CNN features of a consecutive frame pair. The Transformer decoder takes queries as input and output bounding boxes. PatchTrack deals with four types of queries: object queries, track queries, patch queries, and patch-track queries. Depending on the source of the queries, the predicted bounding boxes may correspond to either tracks associated with existing tracking IDs or detections that need to be assigned with new tracking IDs.

3.2 Object tracking initialization

Object tracking for the first frame f1f_{1} is equivalent to object detection, where each predicted detection can be arbitrarily assigned to a unique tracking ID to form tracks. Frame f1f_{1} is sent to the CNN backbone that outputs the corresponding frame feature. This feature is stacked with itself [35] and sent to the Transformer encoder. Since there are no existing tracks to form non-object queries, the Transformer decoder only takes object queries as input and produces embeddings. The output embeddings that result in the non-background bounding boxes are the predicted detection in f1f_{1}, each of which is assigned to a unique tracking ID to form tracks. These embeddings are also used as the track queries for the next frame.

3.3 Track propagation

For frame fkf_{k} (k>1k>1), there exists fk1f_{k-1} with a set of tracks Tk1T_{k-1}. We can propagate these tracks using a motion model and infer tracks in fkf_{k} (Algorithm 1).

Here we use the Kalman filter [43] as our motion model to predict a set of track candidates for fkf_{k}, namely T^k\widehat{T}_{k}. The reason we call them track candidates is because there are several problems if we use them directly as tracks in fkf_{k}. First of all, since the tracks in T^k\widehat{T}_{k} are mapped one-to-one with the ones in Tk1T_{k-1}, they only include objects that have appeared in fk1f_{k-1}. Secondly, although Kalman filter and other motion models have shown effectiveness in many cases [3, 40, 51], their predicted bounding boxes are not accurate enough in terms of locating objects. This is the reason why motion models are often used to process existing tracks, and IoU is introduced to match processed tracks with new detections to form new tracks. In the paradigm of joint-object-detection-and-tracking, our architecture is designed to refine these track candidates to more accurate tracks.

1
Input : Tracks Tk1T_{k-1} in frame fk1f_{k-1};
Motion model M
Output : Track candidates T^k\widehat{T}_{k} for frame fkf_{k}
2 Initialization: T^k\widehat{T}_{k}\leftarrow\emptyset;
3 for tTk1t\in T_{k-1} do
4       T^kT^k{M(t)}\widehat{T}_{k}\leftarrow\widehat{T}_{k}\cup\{\texttt{M}(t)\} ;
5      
6 end for
Algorithm 1 Pseudo-code for object propagation

3.4 Patch generation and object tracking

To tackle the above problems, we take inspiration from UP-DETR [6] where its Transformer decoder is pre-trained to detect locations of random image patches using their corresponding CNN features. Our proposed PatchTrack takes patches of frame fkf_{k} as additional visual information besides the entire fkf_{k} to perform object tracking. Specifically, for each track candidate t^T^k\widehat{t}\in\widehat{T}_{k}, we crop the frame using its bounding box and send the resulting patch to the CNN backbone to get the corresponding patch feature. We use a fully-connected (FC) layer followed by global average pooling (GAP) to process all patch features to patch queries that align with track queries (Figure 2). Each patch query is added to the track query from the same tracking ID to form a patch-track query. The patch-track queries are sent to the Transformer decoder alone with the initial object queries, both of which are processed jointly. Output embedding decoded from each patch-track query may either correspond to the refined location of the corresponding track candidate, or the background if the object has left fkf_{k}. On the other hand, the embeddings decoded from object queries that result in non-background detections locate new objects entering fkf_{k}, which are assigned with new tracking IDs to form new tracks. All embeddings that contribute tracks in fkf_{k} form the track queries for fk+1f_{k+1} (Figure 2).

3.5 Track re-birth

To obtain track queries for frame fk+1f_{k+1} from the track queries for fkf_{k}, embeddings corresponding to the new detections are added and track queries corresponding to background class are removed (Figure 2). A problem with this mechanism is that it is not robust to long-range tracking: if one object is not successfully detected, it can only be assigned to a new tracking ID when it is detected again, which causes fragmented trajectories. To tackle this problem, we adopt the track re-identification strategy from TrackFormer [24] and store these originally removed patch-track queries to an inactive query set. Queries in this set are included in the list of patch-track queries and sent to the decoder for at most PP consecutive frames. If the queries can be decoded to non-background bounding boxes during this process, these queries are re-activated with their original tracking IDs, otherwise they will be removed.

3.6 Set prediction loss

As shown in the model architecture 2, PatchTrack processes a frame pair fk1f_{k-1} and fkf_{k} iteratively, and there are two steps involved. The first step is performing object detection on fk1f_{k-1} in order to initialize track queries for processing fkf_{k} later. The second step is to perform object tracking on fkf_{k} using previously generated track queries. Since the second steps involves detecting new objects, which is the same as the first step, as well as tracking existing object with tracking IDs associated with track queries, we use two set prediction loss [5], one for detection new objects and the other for tracking objects existing in fk1f_{k-1}.

Let us denote Tk1T_{k-1} and TkT_{k} as the tracks for fk1f_{k-1} and fkf_{k} respectively. In the case of detecting new objects, we are looking at any track tTkTk1t\in T_{k}\setminus T_{k-1}, which corresponds to new objects in fkf_{k} but not fk1f_{k-1}. We adopt object detection set prediction loss following the matching cost in TransTrack [35] and DETR [5]:

det=λclsdet_cls+λL1det_L1+λIoUdet_IoU,\mathcal{L}_{det}=\lambda_{cls}\cdot\mathcal{L}_{det\_cls}+\lambda_{L1}\cdot\mathcal{L}_{det\_L1}+\lambda_{IoU}\cdot\mathcal{L}_{det\_IoU}, (1)

where det_cls\mathcal{L}_{det\_cls} is the focal loss [20] between predicted class labels and the ground truth, det_L1\mathcal{L}_{det\_L1} and det_IoU\mathcal{L}_{det\_IoU} are L1 loss and generalized IoU loss [32] between the normalized center and sides of the predicted bounding boxes and ground truth, while λcls\lambda_{cls}, λL1\lambda_{L1} and λIoU\lambda_{IoU} are their weights respectively. Predictions generated from decoding object queries are compared with the ground truth tTkTk1t\in T_{k}\setminus T_{k-1}, so det\mathcal{L}_{det} handles new object detection.

Similarly, our object tracking set prediction loss is as follows:

trk=λclstrk_cls+λL1trk_L1+λIoUtrk_IoU,\mathcal{L}_{trk}=\lambda_{cls}\cdot\mathcal{L}_{trk\_cls}+\lambda_{L1}\cdot\mathcal{L}_{trk\_L1}+\lambda_{IoU}\cdot\mathcal{L}_{trk\_IoU}, (2)

where trk_cls\mathcal{L}_{trk\_cls}, trk_L1\mathcal{L}_{trk\_L1}, and trk_IoU\mathcal{L}_{trk\_IoU} are calculated between predictions generated from decoding patch-track queries and the ground truth tTkTk1t\in T_{k}\cap T_{k-1}, so trk\mathcal{L}_{trk} handles tracking objects in fk1f_{k-1} and predict their new locations in fkf_{k}.

Our final loss function is simply the sum of object detection set prediction loss and object tracking set prediction loss: =det+trk\mathcal{L}=\mathcal{L}_{det}+\mathcal{L}_{trk}.

Dataset Method MOTA\uparrow IDF1\uparrow MT\uparrow ML\downarrow FP\downarrow FN\downarrow IDsw\downarrow
MOT16 DeepSORT [44] 61.4 62.2 32.8 18.2 12,852 56,668 781
HTA [21] 62.4 64.2 37.5 12.1 19,071 47,839 1,619
VMaxx [41] 62.6 49.2 32.7 21.1 10,604 56,182 1,389
RAR16 [9] 63.0 63.8 39.9 22.1 13,663 53,248 482
TAP [55] 64.8 73.5 40.6 22.0 12,980 50,635 794
CNNMTT [23] 65.2 62.2 32.4 21.3 6,578 55,896 946
POI [49] 66.1 65.1 34.0 21.3 5,061 55,914 805
GSDT [42] 66.7 69.2 38.6 19.0 14,754 45,057 959
TubeTK [27] 66.9 62.2 39.0 18.1 11,544 47,502 1,236
LM_CNN [1] 67.4 61.2 38.2 19.2 10,109 48,435 931
Chain-Tracker [29] 67.6 57.2 32.9 23.1 8,934 48,305 1,897
KDNT(POI) [49] 68.2 60.0 41.0 19.0 11,479 45,605 933
FairMOT [51] 69.3 72.3 40.3 16.7 13,501 41,653 815
QuasiDense [28] 69.8 67.1 41.6 19.8 9,861 44,050 1,097
TraDeS [45] 70.1 64.7 37.3 20.0 8,091 45,210 1,144
LMP_p [37] 71.0 70.1 46.9 21.9 7,880 44,564 434
PatchTrack (Ours) 73.3 65.8 45.7 11.3 10,660 36,824 1,179
Table 1: Evaluation on the MOT16 test set. We evaluate recent MOT systems on the MOT16 test set in the private detection protocol. The method names are taken directly from the leaderboard of motchallenge, where the names in parentheses are associated with their respective literatures. Metrics with \uparrow means higher numbers are preferable, while the ones with \downarrow means lower numbers are preferable. Numbers are marked in bold if they are the best in their respective metric columns. Our proposed PatchTrack achieves best results in MOTA, ML, and FN.
Refer to caption
(a) LMP_p MOT16-08 Frame 210
Refer to caption
(b) POI MOT16-08 Frame 210
Refer to caption
(c) PatchTrack MOT16-08 Frame 210
Refer to caption
(d) LMP_p MOT16-08 Frame 420
Refer to caption
(e) POI MOT16-08 Frame 420
Refer to caption
(f) PatchTrack MOT16-08 Frame 420
Figure 3: Visualizations on the MOT16 test set. Visualizations on the MOT16 test set are taken from motchallenge. We add additional annotations in red to show challenging cases where LMP_p [37] and POI [49] fail to track. While both LMP_p (Figure 3(a)) and POI (Figure 3(b)) fail to track objects that are partially occluded, PatchTrack is still able to locate such objects (Figure 3(c)). Additionally, PatchTrack performs better in distinguish different objects in a cluster (Figure 3(f)) without missing (Figure 3(e)) objects or tracking one object twice (Figure 3(d)).
Dataset (CNN-based) method MOTA\uparrow IDF1\uparrow MT\uparrow ML\downarrow FP\downarrow FN\downarrow IDsw\downarrow
MOT17 DAN [36] 52.4 49.5 21.4 30.7 25,423 234,592 8,431
TubeTK [27] 63.0 58.6 31.2 19.9 27,060 177,483 4,137
GSDT [42] 66.2 63.4 36.9 21.7 25,800 164,120 2,711
Chained-Tracker [29] 66.6 57.4 37.8 18.5 22,284 160,491 5,529
CenterTrack [53] 67.8 64.7 34.6 24.6 18,498 160,332 3,039
QuasiDense [28] 68.7 66.3 40.6 21.9 26,589 146,643 3,378
TraDes [45] 69.1 63.9 36.4 21.5 20,892 150,060 3,555
MAT [14] 69.5 63.1 43.8 18.9 30,660 138,741 2,844
SOTMOT [52] 71.0 71.9 42.7 15.3 39,537 118,983 5,184
RADTrack (RelationTrack) [48] 73.1 73.7 39.9 20.0 25,935 122,700 3,021
GSDT [42] 73.2 66.5 41.7 17.5 26,397 120,666 3,891
Semi-TCL [18] 73.3 73.2 41.8 18.7 22,944 124,980 2,790
FairMOT [51] 73.7 72.3 43.2 17.3 27,507 117,477 3,303
RelationTrack [48] 73.8 74.7 41.7 23.2 27,999 118,623 1,374
PermaTrackPr [38] 73.8 68.9 43.8 17.2 28,998 115,104 3,699
CSTrack [19] 74.9 72.6 41.5 17.5 23,847 114,303 3,567
PatchTrack (ours) 73.6 65.2 44.6 12.5 23,976 121,230 3,795
Transformer-based method
MOTR [50] 65.1 66.4 33.0 25.2 45,486 149,307 2,049
TrackFormer [24] 65.0 63.9 45.6 13.8 70,443 123,552 3,528
MOTPrivate (TransCenter) [46] 70.0 62.1 38.9 20.4 28,119 136,722 4,647
TransCenter [46] 73.2 62.2 40.8 18.5 23,112 123,738 4,614
TrTrack (TransTrack) [35] 75.2 63.5 55.3 10.2 50,157 86,442 3,603
PatchTrack (ours) 73.6 65.2 44.6 12.5 23,976 121,230 3,795
Table 2: Evaluation on MOT17 test set. We evaluate recent MOT systems on the MOT17 test set in a private detection protocol. Compared to CNN-based (non Transformer-based) methods, PatchTrack outperforms in MT and ML. We also compare our proposed method with MOT systems that are also Transformer based. Numbers are in bold if they are the best in their respective metric columns, and in blue if they are the second-to-best.
Refer to caption
(a) TransTrack MOT17-07 Frame 402
Refer to caption
(b) TransTrack MOT17-07 Frame 420
Refer to caption
(c) TransTrack MOT17-07 Frame 438
Refer to caption
(d) PatchTrack MOT17-07 Frame 402
Refer to caption
(e) PatchTrack MOT17-07 Frame 420
Refer to caption
(f) PatchTrack MOT17-07 Frame 438
Figure 4: Visualizations on the MOT17 test set. Comparing to TransTrack [35], PatchTrack is able to show comparable performance and generate less than 50% FP, while TransTrack suffers from detecting one object multiple times (Figure 4(a)) and ID switches (Figure 4(c)) when trying to track fully-occluded objects (Figure 4(b)).

4 Experiments

4.1 Datasets and metrics

MOT MOT benchmarks are among the most widely used multi-object tracking benchmarks. We perform experiments on two of the MOT benchmarks: MOT16 and MOT17 [26]. MOT16 consists of a training set of 7 videos (5,316 frames and 336,891 tracks) and a test set of 7 videos (5,919 frames and 564,228 tracks) with FPS ranging from 14 to 30. To evaluate the performance of the tracking mechanism independently of the detection accuracy, this benchmark also provides public detection from Faster R-CNN [31]. MOT17 consists of the same training set and test set as MOT16, but with additional public detection from DPM [10] and SDP [47]. Both MOT16 and MOT17 are annotated with full-body bounding boxes.

CrowdHuman CrowdHuman [33] is a pedestrian detection benchmark. It contains 15,000 training images and 4,370 validation images with a total of 470K objects. The annotations are also human full-body bounding boxes. This benchmark is often used for pre-training MOT systems.

Metrics MOT benchmarks [17, 26, 7] uses metrics from CLEAR [2], which includes Multiple-Object Tracking Accuracy (MOTA), Identity F1 score (IDF1), Identity Switch (IDsw), False Positive (FP), False Negative (FN) detections, as well as Mostly Tracked (MT) and Mostly Lost (ML) trajectories.

4.2 Training data generation

Given the architecture of PatchTrack (Figure 2), we need two consecutive frames to train the model. Although we could simply take frames pairs, predict track candidates from tracks of the previous frame using Kalman filter [43] as shown in the architecture, Kalman filter would not be able to provide high quality predictions due to high uncertainty in the early stage when there is a lack of prior information, which will in turn degrades the performance of decoder since the patch queries do not serve as good guesses to where existing tracks may be in the current frame.

To simulate the role of Kalman filter [43] and generate track candidates for training, we propose the following augmentation strategy. Given a frame pair fk1f_{k-1} and fkf_{k}. We first randomly shift and reshape each track bounding box in frame fk1f_{k-1} within a pre-defined domain. We ensure that the IoU between each augmented bounding box and the track bounding box in frame fkf_{k} with the same tracking ID, if exists, is at least 0.5. This is to align with commonly used IoU threshold value in detection association [44, 3, 51]. These augmented tracks are the track candidates to our system during training.

We also adapt the track augmentation strategy from Trackformer [24], where we introduce false negatives by removing some queries associated with tracks that exist in both fk1f_{k-1} and fkf_{k} from the input. The objective of the system is to detect the corresponding objects as new objects using object queries. On the other end, we sample output embeddings (generated from performing object detection on fk1f_{k-1}) that map to background bounding boxes. They are included in the track queries as false positives when performing object tracking on fkf_{k}. To obtain their corresponding patch queries, we get their respective bounding boxes and augment them in the same manner as track candidates generation. We ensure that the IoU of each augmented bounding box is below 0.5 with ground truth tracks in fkf_{k}. For each patch-track queries generated from the above procedure, our system should decode them and get background objects.

Frame pairs are selected from two sources. The first one is video data from MOT benchmarks [26], where we take two video clips within a certain range from each other in the same video. This gives us more variety in terms of camera motion. The second one is image data from CrowdHuman [33], where we augment a single image through random scaling and translating to obtain a frame pair. For each selected frame pair, we perform the aforementioned steps to generate track candidates and modify the ground truth corresponding to false positives/negatives we inserted manually. PatchTrack is optimized towards the modified ground truth during training.

4.3 Implementation details

The Kalman filter [43] following a constant velocity model is used to predict track candidates. PatchTrack uses ResNet-50[15] pre-trained on ImageNet [8] as its CNN backbone and Deformable DETR [56] for the Transformer encoder-decoder framework. The number of object queries is set to be 500. Inactive track queries will be kept for 30 frames for track re-birth.

We adopt the training procedure from TransTrack [35] as follows. The optimizer is AdamW with β1=0.9,β2=0.999\beta_{1}=0.9,\beta_{2}=0.999 and initial learning rate 2e42\mathrm{e}{-4}. We use 8 NVIDIA Tesla V100 GPUs with batch size 16. PatchTrack is first pre-trained on CrowdHuman [33] for 150 epochs with the learning rate dropped to 2e52\mathrm{e}{-5} after the first 100 epochs. Then, PatchTrack is trained on both CrowdHuman and MOT17 [26] for another 20 epochs. Lastly, it is evaluated on MOT16 and MOT17 [26] test sets.

4.4 Results

MOT16 We compare PatchTrack with other MOT systems on MOT16 [26] test set in private protocol (Table 1), where PatchTrack achieves state-of-the-art results in MOTA, ML, and FN. Compared to LMP_p [37] and POI [49], which collectively achieve best results in the remaining metrics, PatchTrack has significantly lower ML, showing overall better tracking performance. Figure 3 shows additional visual comparison with LMP_p and POI, where PatchTrack is able to track partially occluded objects and distinguish crowded objects better without missing objects or tracking one object multiple times.

MOT17 Table 2 shows quantitative results of PatchTrack along with other recent MOT systems on MOT17 [26] test set in private protocol. Compared to Non-Transformer-based methods, PatchTrack reports best numbers in MT and ML, and shows superior ability in trajectory prediction. On the other hand, PatchTrack performs comparably well with other Transformer-based methods, achieving second-to-best results in most metrics. Compared to TransTrack [35], which has state-of-the-art results in MOTA, MT, ML, and FN, our system is able to produce less than 50% of FP. We provide additional visualizations of PatchTrack and TransTrack in Figure 4. While PatchTrack is able to perform on par with TransTrack, our system is able to avoid tracking one object multiple times or causing ID switches when a previously fully occluded object re-appears.

4.5 Ablation study

The ablation study is performed on the MOT17 [26] validation set. The original MOT17 training set is split to a new training set and validation set, each consisting of the first half and the second half of training videos. After pre-training PatchTrack on CrowdHuman [33], the system is fine-tuned on the both CrowdHuman and the new MOT17 training set and evaluated on the validation set.

Type of queries We evaluate the effect of various queries in Table 3. Removal of only patch queries or track queries means the other is sent to the Transformer decoder along with object queries. Removal of patch-track queries means that the decoder takes in object queries only and essentially behaves like an object detector. After getting individual detections for each frame, we use the Kalman filter [43] and the Hungarian algorithm [16] to associate them. In this case, the modified system falls into the tracking-by-detection paradigm. We see that both patch queries and track queries play an important role in the joint-detection-and-tracking setting. On the other hand, the performance of the tracking-by-detection version of our system is overall comparable with PatchTrack, but produces more ID switches.

Method MOTA MT ML IDsw
w/o patch queries 71.4 165 42 214
w/o track queries 66.3 141 61 248
w/o patch-track queries 72.0 176 40 200
PatchTrack 72.1 176 40 192
Table 3: Ablation study on type of query inputs. We send different types of query inputs to our system and evaluate their effects. The results suggest the positive effect of patch queries and track queries. When the system doesn’t use patch-track queries and behave as an object detector, where we use Kalman filter [43] and Hungarian algorithm [16] to associate predicted detections, the system produces more ID switches.

Source of frame patches We also evaluate patch queries generated from different sources. The previous bboxes patches come directly from cropping the current frame of interest using bounding boxes of tracks in the previous frame. Alternatively, the previous frame patches are generated using both the previous frame and bounding boxes of tracks in the previous frame. From Table 4, we see similar results when using patches from the previous frame compared to using track queries alone, meaning that patches from the previous frame contains similar information to track queries. On the other hand, patches generated from the current frame with bounding boxes of tracks in the previous frame degrade the performance. We reason that it is because of the misalignment between the frame and bounding boxes, which leads to less useful information in patches.

Method MOTA MT ML IDsw
w/o patch query 71.4 165 42 214
previous bboxes 62.8 137 69 258
previous frame 71.4 165 42 214
PatchTrack 72.1 176 40 192
Table 4: Ablation study on source of frame patches. We test patch queries generated from different sources. When the patches come from cropping the current frame using the track bounding boxes from the previous frame (previous bboxes), the corresponding patch queries have a negative effect on the performance.

5 Conclusion

We present PatchTrack, a Transformer-based joint-detection-and-tracking system using frame patches. By generating patch queries from the current frame of interest and track predictions using a motion model, we obtain information about object motion and appearance that is associated with the current frame. This novel way of using visual information in the current frame adds additional information to track queries that are derived from previous frames. By using both queries collectively, PatchTrack is able to achieve competitive results on MOT benchmarks.

References

  • [1] Maryam Babaee, Zimu Li, and Gerhard Rigoll. A dual cnn–rnn for multiple people tracking. Neurocomputing, 368:69–83, 2019.
  • [2] Keni Bernardin and Rainer Stiefelhagen. Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing, 2008:1–10, 2008.
  • [3] Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. Simple online and realtime tracking. In 2016 IEEE international conference on image processing (ICIP), pages 3464–3468. IEEE, 2016.
  • [4] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6154–6162, 2018.
  • [5] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European Conference on Computer Vision, pages 213–229. Springer, 2020.
  • [6] Zhigang Dai, Bolun Cai, Yugeng Lin, and Junying Chen. Up-detr: Unsupervised pre-training for object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1601–1610, 2021.
  • [7] Patrick Dendorfer, Hamid Rezatofighi, Anton Milan, Javen Shi, Daniel Cremers, Ian Reid, Stefan Roth, Konrad Schindler, and Laura Leal-Taixé. Mot20: A benchmark for multi object tracking in crowded scenes. arXiv preprint arXiv:2003.09003, 2020.
  • [8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • [9] Kuan Fang, Yu Xiang, Xiaocheng Li, and Silvio Savarese. Recurrent autoregressive networks for online multi-object tracking. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 466–475. IEEE, 2018.
  • [10] Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan. Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence, 32(9):1627–1645, 2009.
  • [11] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
  • [12] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
  • [13] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
  • [14] Shoudong Han, Piao Huang, Hongwei Wang, En Yu, Donghaisheng Liu, Xiaofeng Pan, and Jun Zhao. Mat: Motion-aware multi-object tracking. arXiv preprint arXiv:2009.04794, 2020.
  • [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [16] István Kenesei, Robert M Vago, and Anna Fenyvesi. Hungarian. Routledge, 2002.
  • [17] Laura Leal-Taixé, Anton Milan, Ian Reid, Stefan Roth, and Konrad Schindler. Motchallenge 2015: Towards a benchmark for multi-target tracking. arXiv preprint arXiv:1504.01942, 2015.
  • [18] Wei Li, Yuanjun Xiong, Shuo Yang, Mingze Xu, Yongxin Wang, and Wei Xia. Semi-tcl: Semi-supervised track contrastive representation learning. arXiv preprint arXiv:2107.02396, 2021.
  • [19] Chao Liang, Zhipeng Zhang, Yi Lu, Xue Zhou, Bing Li, Xiyong Ye, and Jianxiao Zou. Rethinking the competition between detection and reid in multi-object tracking. arXiv preprint arXiv:2010.12138, 2020.
  • [20] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
  • [21] Xufeng Lin, Chang-Tsun Li, Victor Sanchez, and Carsten Maple. On the detection-to-track association for online multi-object tracking. Pattern Recognition Letters, 146:200–207, 2021.
  • [22] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
  • [23] Nima Mahmoudi, Seyed Mohammad Ahadi, and Mohammad Rahmati. Multi-target tracking using cnn-based features: Cnnmtt. Multimedia Tools and Applications, 78(6):7077–7096, 2019.
  • [24] Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, and Christoph Feichtenhofer. Trackformer: Multi-object tracking with transformers. arXiv preprint arXiv:2101.02702, 2021.
  • [25] Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, and Jingdong Wang. Conditional detr for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3651–3660, 2021.
  • [26] Anton Milan, Laura Leal-Taixé, Ian Reid, Stefan Roth, and Konrad Schindler. Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831, 2016.
  • [27] Bo Pang, Yizhuo Li, Yifan Zhang, Muchen Li, and Cewu Lu. Tubetk: Adopting tubes to track multi-object in a one-step training model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6308–6318, 2020.
  • [28] Jiangmiao Pang, Linlu Qiu, Xia Li, Haofeng Chen, Qi Li, Trevor Darrell, and Fisher Yu. Quasi-dense similarity learning for multiple object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 164–173, 2021.
  • [29] Jinlong Peng, Changan Wang, Fangbin Wan, Yang Wu, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Yanwei Fu. Chained-tracker: Chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking. In European Conference on Computer Vision, pages 145–161. Springer, 2020.
  • [30] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
  • [31] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28:91–99, 2015.
  • [32] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 658–666, 2019.
  • [33] Shuai Shao, Zijian Zhao, Boxun Li, Tete Xiao, Gang Yu, Xiangyu Zhang, and Jian Sun. Crowdhuman: A benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123, 2018.
  • [34] Bing Shuai, Andrew Berneshawi, Xinyu Li, Davide Modolo, and Joseph Tighe. Siammot: Siamese multi-object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12372–12382, 2021.
  • [35] Peize Sun, Yi Jiang, Rufeng Zhang, Enze Xie, Jinkun Cao, Xinting Hu, Tao Kong, Zehuan Yuan, Changhu Wang, and Ping Luo. Transtrack: Multiple-object tracking with transformer. arXiv preprint arXiv:2012.15460, 2020.
  • [36] ShiJie Sun, Naveed Akhtar, HuanSheng Song, Ajmal Mian, and Mubarak Shah. Deep affinity network for multiple object tracking. IEEE transactions on pattern analysis and machine intelligence, 43(1):104–119, 2019.
  • [37] Siyu Tang, Mykhaylo Andriluka, Bjoern Andres, and Bernt Schiele. Multiple people tracking by lifted multicut and person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3539–3548, 2017.
  • [38] Pavel Tokmakov, Jie Li, Wolfram Burgard, and Adrien Gaidon. Learning to track with object permanence. arXiv preprint arXiv:2103.14258, 2021.
  • [39] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  • [40] Balaji Veeramani, John W Raymond, and Pritam Chanda. Deepsort: deep convolutional networks for sorting haploid maize seeds. BMC bioinformatics, 19(9):1–9, 2018.
  • [41] Xingyu Wan, Jinjun Wang, Zhifeng Kong, Qing Zhao, and Shunming Deng. Multi-object tracking using online metric learning with long short-term memory. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 788–792. IEEE, 2018.
  • [42] Yongxin Wang, Kris Kitani, and Xinshuo Weng. Joint object detection and multi-object tracking with graph neural networks. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13708–13715. IEEE, 2021.
  • [43] Greg Welch, Gary Bishop, et al. An introduction to the kalman filter. 1995.
  • [44] Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online and realtime tracking with a deep association metric. In 2017 IEEE International Conference on Image Processing (ICIP), pages 3645–3649. IEEE, 2017.
  • [45] Jialian Wu, Jiale Cao, Liangchen Song, Yu Wang, Ming Yang, and Junsong Yuan. Track to detect and segment: An online multi-object tracker. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12352–12361, 2021.
  • [46] Yihong Xu, Yutong Ban, Guillaume Delorme, Chuang Gan, Daniela Rus, and Xavier Alameda-Pineda. Transcenter: Transformers with dense queries for multiple-object tracking. arXiv preprint arXiv:2103.15145, 2021.
  • [47] Fan Yang, Wongun Choi, and Yuanqing Lin. Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2129–2137, 2016.
  • [48] En Yu, Zhuoling Li, Shoudong Han, and Hongwei Wang. Relationtrack: Relation-aware multiple object tracking with decoupled representation. arXiv preprint arXiv:2105.04322, 2021.
  • [49] Fengwei Yu, Wenbo Li, Quanquan Li, Yu Liu, Xiaohua Shi, and Junjie Yan. Poi: Multiple object tracking with high performance detection and appearance feature. In European Conference on Computer Vision, pages 36–42. Springer, 2016.
  • [50] Fangao Zeng, Bin Dong, Tiancai Wang, Cheng Chen, Xiangyu Zhang, and Yichen Wei. Motr: End-to-end multiple-object tracking with transformer. arXiv preprint arXiv:2105.03247, 2021.
  • [51] Yifu Zhang, Chunyu Wang, Xinggang Wang, Wenjun Zeng, and Wenyu Liu. Fairmot: On the fairness of detection and re-identification in multiple object tracking. arXiv preprint arXiv:2004.01888, 2020.
  • [52] Linyu Zheng, Ming Tang, Yingying Chen, Guibo Zhu, Jinqiao Wang, and Hanqing Lu. Improving multiple object tracking with single object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2453–2462, 2021.
  • [53] Xingyi Zhou, Vladlen Koltun, and Philipp Krähenbühl. Tracking objects as points. In European Conference on Computer Vision, pages 474–490. Springer, 2020.
  • [54] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Objects as points. arXiv preprint arXiv:1904.07850, 2019.
  • [55] Zongwei Zhou, Junliang Xing, Mengdan Zhang, and Weiming Hu. Online multi-target tracking with tensor-based high-order graph matching. In 2018 24th International Conference on Pattern Recognition (ICPR), pages 1809–1814. IEEE, 2018.
  • [56] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.