PatchTrack: Multiple Object Tracking Using Frame Patches

Xiaotong Chen
Computer Science
UC, Santa Barbara
[email protected] The work is done during an internship at Appen Seyed Mehdi Iranmanesh
Amazon
[email protected] The work is done while at Appen Kuo-Chin Lien
Appen
[email protected]

Abstract

Object motion and object appearance are commonly used information in multiple object tracking (MOT) applications, either for associating detections across frames in tracking-by-detection methods or direct track predictions for joint-detection-and-tracking methods. However, not only are these two types of information often considered separately, but also they do not help optimize the usage of visual information from the current frame of interest directly. In this paper, we present PatchTrack, a Transformer-based joint-detection-and-tracking system that predicts tracks using patches of the current frame of interest. We use the Kalman filter to predict the locations of existing tracks in the current frame from the previous frame. Patches cropped from the predicted bounding boxes are sent to the Transformer decoder to infer new tracks. By utilizing both object motion and object appearance information encoded in patches, the proposed method pays more attention to where new tracks are more likely to occur. We show the effectiveness of PatchTrack on recent MOT benchmarks, including MOT16 (MOTA 73.71%, IDF1 65.77%) and MOT17 (MOTA 73.59%, IDF1 65.23%). The results are published on https://motchallenge.net/method/MOT=4725&chl=10.

Refer to caption — (a) Tracking using tracking queries (output embeddings of the previous frame).

1 Introduction

Multiple object tracking (MOT) concerns identifying objects of interest and tracking their moving trajectories in video sequences. Intuitively, successful MOT algorithms need to be able to handle subtle appearance differences between multiple tracked objects and resolve the ambiguity via other cues, such as motion, when the targets are visually indistinguishable.

With the powerful appearance encoding capability of CNN, the tracking-by-detection paradigm dominates MOT methods in the past decade [5, 51, 44]. Highly accurate CNN-based object detection [30, 31, 4] is first performed in all frames independently, and then association of these detected objects across frames is performed to establish tracks of consistent object IDs. In the association step, locations of existing tracks in the following frame may be predicted from assumption (constant velocity, acceleration, etc.) or other motion models [51, 34, 43, 44] and then associate with detections based on metrics like intersection-over-union (IoU).

Joint-detection-and-tracking methods [53, 35, 45] recently demonstrate superior accuracy. The idea is simultaneously performing object detection and tracking so both tasks enjoy information shared from the other. This is particularly intriguing in Transformer-based architectures where output feature embeddings of previous frames are used as ‘track queries’, along with ‘object queries’ for Transformer decoder, predicting corresponding tracks as well as newly discovered objects in the current frame (Figure 1(a)). Albeit achieving state of the art MOT results, we argue that these architectures overly rely on appearance. As the information encoded in track queries is strictly limited to previous frames, the Transformer model needs to infer both object offset and object appearance in the current frame.

To resolve above problem, we take inspiration from UP-DETR [6], an object detection model that is pre-trained to detect image patches (Figure 1(b)) using patch features, and propose a MOT system that uses frame patches from the current frame of interest. We first use a motion model to predict new locations of existing tracks in the current frame from the previous frame, and crop the current frame to patches based on the prediction. These patches, with implicit prior knowledge of object motion and explicit information of object appearance in the current frame, are sent to the decoder to predict new locations of existing tracks in the current frame.

More specifically, we present PatchTrack (Figure 1(c)), which is a Transformer-based joint-object-detection-and-tracking system that predicts tracks in the current frame of interest from its patches. We use the Kalman filter [43] to obtain track candidates in the current frame from existing tracks in the previous frame and crop the current frame using the bounding box of these candidates to get patches. Both the current frame and these patches are sent into our convolutional neural network (CNN) [13] backbone that outputs the frame feature and the patch queries respectively. Each pair of track query, from the output embeddings when processing the previous frame, and patch query with the same tracking ID are added together to form the corresponding patch-track query. These patch-track queries are sent to the decoder along with object queries, where the former is used to predict new locations of existing tracks, while the latter is used to detect new objects in the current frame.

We evaluate PatchTrack on MOT benchmarks and achieve competitive results on MOT16 (MOTA 73.71%, IDF1 65.77%) and MOT17 (MOTA 73.59%, IDF1 65.23%) test sets. To the best of our knowledge, our method is the first that uses patches of the current frame of interest to infer both object motion and appearance information simultaneously. We hope it could provide a new perspective for designing MOT systems.

In summary, our contributions are:

•

A Transformer-based MOT system, namely PatchTrack, which jointly performs object detection and tracking.
•

A novel way of optimizing the usage of visual information by utilizing patches from the current frame of interest.
•

Introduction of patch-track queries that incorporate both knowledge of the object motion and object appearance in the current frame of interest to facilitate tracking.

2 Related Work

2.1 Object detection and tracking

Object detection concerns locating and/or classifying objects of interest in a single image. As the preliminary to object tracking, there is a close connection between the two. Many popular object detection methods generate detections from hypothesis of object locations, including regional proposals [12, 11, 31, 4] and anchors/object centers [30, 22, 54]. On the other hand, there is an increasing number of object tracking systems that utilize Transformer [39], which has shown success in object detection [5, 25, 56, 6] before. Transformer-based object-detection methods encode the CNN [13] feature of images and decodes learned object queries to obtain detections. Aside from architecture adjustment [25, 56] from the original DETR [5], we also see modification to object queries [6] using image patches to facilitate detection. Inspired by the usage of regional proposal and image patches, our proposed method uses frame patches, which can be considered as our initial guess of track locations and appearance.

2.2 Tracking-by-detection

One major paradigm in MOT is tracking-by-detection, where the MOT systems [5, 51, 44] first obtain detections for each frame and then associate them across frames to form tracks. Since the object detection is a standalone step in the tracking process, one benefit of tracking-by-detection methods is the flexibility to pair different object detection models [31, 30, 5] with different association strategies, thus be benefited directly from advancement in the area of object detection. On the other hand, the object detection step omits information across frames as each of them is processed separately by the detector.

Object motion and appearance may only be considered as part of the detection association strategy for these methods [51, 34]. For object motion, Kalman filter [43] is one of the most popular algorithm used to propagate detections in previous frame to predict their location in its future frame. Combined with Hungarian algorithm [16] and intersection-over-union (IOU) metrics, it has proven to be an effective tracking mechanism [3]. Object appearance information like Re-ID features [44, 28, 51] are also commonly used as similarity measures.

2.3 Joint-detection-and-tracking

The other popular paradigm in MOT is joint-detection-and-tracking, where the object detection and object tracking are performed simultaneously [53, 35]. One advantage of joint-detection-and-tracking methods is the accessibility to information across frames. For instance, features of multiple frames can be used at once [53, 35, 45] to facilitate detection and/or tracking. For Transformer-based joint-detection-and-tracking methods, both the encoder and the decoder may take additional information from previous frames to infer predictions of the current frame of interest [35, 24, 50]. Specifically, recent works have introduced track queries [35, 24], which come from the output embeddings when processing previous frames. Depends on the design, the track queries may be decoded to bounding boxes separately from the object queries [35] and matched together to predict new tracks, or processed together to form new tracks directly [24].

3 Method

In this section, we describe the architecture of PatchTrack (Section 3.1), how object tracking is initialized (Section 3.2), how existing tracks are propagated to form new track candidates (Section 3.3), and how frame patches are generated to help facilitate object tracking(Section 3.4).

3.1 Architecture

PatchTrack is a Transformer-based joint-detection-and-tracking system. The Transformer encoder takes in CNN features of a consecutive frame pair. The Transformer decoder takes queries as input and output bounding boxes. PatchTrack deals with four types of queries: object queries, track queries, patch queries, and patch-track queries. Depending on the source of the queries, the predicted bounding boxes may correspond to either tracks associated with existing tracking IDs or detections that need to be assigned with new tracking IDs.

3.2 Object tracking initialization

Object tracking for the first frame $f_{1}$ is equivalent to object detection, where each predicted detection can be arbitrarily assigned to a unique tracking ID to form tracks. Frame $f_{1}$ is sent to the CNN backbone that outputs the corresponding frame feature. This feature is stacked with itself [35] and sent to the Transformer encoder. Since there are no existing tracks to form non-object queries, the Transformer decoder only takes object queries as input and produces embeddings. The output embeddings that result in the non-background bounding boxes are the predicted detection in $f_{1}$ , each of which is assigned to a unique tracking ID to form tracks. These embeddings are also used as the track queries for the next frame.

3.3 Track propagation

For frame $f_{k}$ ( $k>1$ ), there exists $f_{k-1}$ with a set of tracks $T_{k-1}$ . We can propagate these tracks using a motion model and infer tracks in $f_{k}$ (Algorithm 1).

Here we use the Kalman filter [43] as our motion model to predict a set of track candidates for $f_{k}$ , namely $\widehat{T}_{k}$ . The reason we call them track candidates is because there are several problems if we use them directly as tracks in $f_{k}$ . First of all, since the tracks in $\widehat{T}_{k}$ are mapped one-to-one with the ones in $T_{k-1}$ , they only include objects that have appeared in $f_{k-1}$ . Secondly, although Kalman filter and other motion models have shown effectiveness in many cases [3, 40, 51], their predicted bounding boxes are not accurate enough in terms of locating objects. This is the reason why motion models are often used to process existing tracks, and IoU is introduced to match processed tracks with new detections to form new tracks. In the paradigm of joint-object-detection-and-tracking, our architecture is designed to refine these track candidates to more accurate tracks.

Input : Tracks

T_{k-1}

in frame

f_{k-1}

;

Motion model M

Output : Track candidates

\widehat{T}_{k}

for frame

f_{k}

2 Initialization:

\widehat{T}_{k}\leftarrow\emptyset

;

3 for $t\in T_{k-1}$ do

\widehat{T}_{k}\leftarrow\widehat{T}_{k}\cup\{\texttt{M}(t)\}

;

6 end for

Algorithm 1 Pseudo-code for object propagation

3.4 Patch generation and object tracking

To tackle the above problems, we take inspiration from UP-DETR [6] where its Transformer decoder is pre-trained to detect locations of random image patches using their corresponding CNN features. Our proposed PatchTrack takes patches of frame $f_{k}$ as additional visual information besides the entire $f_{k}$ to perform object tracking. Specifically, for each track candidate $\widehat{t}\in\widehat{T}_{k}$ , we crop the frame using its bounding box and send the resulting patch to the CNN backbone to get the corresponding patch feature. We use a fully-connected (FC) layer followed by global average pooling (GAP) to process all patch features to patch queries that align with track queries (Figure 2). Each patch query is added to the track query from the same tracking ID to form a patch-track query. The patch-track queries are sent to the Transformer decoder alone with the initial object queries, both of which are processed jointly. Output embedding decoded from each patch-track query may either correspond to the refined location of the corresponding track candidate, or the background if the object has left $f_{k}$ . On the other hand, the embeddings decoded from object queries that result in non-background detections locate new objects entering $f_{k}$ , which are assigned with new tracking IDs to form new tracks. All embeddings that contribute tracks in $f_{k}$ form the track queries for $f_{k+1}$ (Figure 2).

3.5 Track re-birth

To obtain track queries for frame $f_{k+1}$ from the track queries for $f_{k}$ , embeddings corresponding to the new detections are added and track queries corresponding to background class are removed (Figure 2). A problem with this mechanism is that it is not robust to long-range tracking: if one object is not successfully detected, it can only be assigned to a new tracking ID when it is detected again, which causes fragmented trajectories. To tackle this problem, we adopt the track re-identification strategy from TrackFormer [24] and store these originally removed patch-track queries to an inactive query set. Queries in this set are included in the list of patch-track queries and sent to the decoder for at most $P$ consecutive frames. If the queries can be decoded to non-background bounding boxes during this process, these queries are re-activated with their original tracking IDs, otherwise they will be removed.

3.6 Set prediction loss

As shown in the model architecture 2, PatchTrack processes a frame pair $f_{k-1}$ and $f_{k}$ iteratively, and there are two steps involved. The first step is performing object detection on $f_{k-1}$ in order to initialize track queries for processing $f_{k}$ later. The second step is to perform object tracking on $f_{k}$ using previously generated track queries. Since the second steps involves detecting new objects, which is the same as the first step, as well as tracking existing object with tracking IDs associated with track queries, we use two set prediction loss [5], one for detection new objects and the other for tracking objects existing in $f_{k-1}$ .

Let us denote $T_{k-1}$ and $T_{k}$ as the tracks for $f_{k-1}$ and $f_{k}$ respectively. In the case of detecting new objects, we are looking at any track $t\in T_{k}\setminus T_{k-1}$ , which corresponds to new objects in $f_{k}$ but not $f_{k-1}$ . We adopt object detection set prediction loss following the matching cost in TransTrack [35] and DETR [5]:

\mathcal{L}_{det}=\lambda_{cls}\cdot\mathcal{L}_{det\_cls}+\lambda_{L1}\cdot\mathcal{L}_{det\_L1}+\lambda_{IoU}\cdot\mathcal{L}_{det\_IoU},

(1)

where $\mathcal{L}_{det\_cls}$ is the focal loss [20] between predicted class labels and the ground truth, $\mathcal{L}_{det\_L1}$ and $\mathcal{L}_{det\_IoU}$ are L1 loss and generalized IoU loss [32] between the normalized center and sides of the predicted bounding boxes and ground truth, while $\lambda_{cls}$ , $\lambda_{L1}$ and $\lambda_{IoU}$ are their weights respectively. Predictions generated from decoding object queries are compared with the ground truth $t\in T_{k}\setminus T_{k-1}$ , so $\mathcal{L}_{det}$ handles new object detection.

Similarly, our object tracking set prediction loss is as follows:

\mathcal{L}_{trk}=\lambda_{cls}\cdot\mathcal{L}_{trk\_cls}+\lambda_{L1}\cdot\mathcal{L}_{trk\_L1}+\lambda_{IoU}\cdot\mathcal{L}_{trk\_IoU},

(2)

where $\mathcal{L}_{trk\_cls}$ , $\mathcal{L}_{trk\_L1}$ , and $\mathcal{L}_{trk\_IoU}$ are calculated between predictions generated from decoding patch-track queries and the ground truth $t\in T_{k}\cap T_{k-1}$ , so $\mathcal{L}_{trk}$ handles tracking objects in $f_{k-1}$ and predict their new locations in $f_{k}$ .

Our final loss function is simply the sum of object detection set prediction loss and object tracking set prediction loss: $\mathcal{L}=\mathcal{L}_{det}+\mathcal{L}_{trk}$ .

Dataset	Method	MOTA $\uparrow$	IDF1 $\uparrow$	MT $\uparrow$	ML $\downarrow$	FP $\downarrow$	FN $\downarrow$	IDsw $\downarrow$
MOT16	DeepSORT [44]	61.4	62.2	32.8	18.2	12,852	56,668	781
	HTA [21]	62.4	64.2	37.5	12.1	19,071	47,839	1,619
	VMaxx [41]	62.6	49.2	32.7	21.1	10,604	56,182	1,389
	RAR16 [9]	63.0	63.8	39.9	22.1	13,663	53,248	482
	TAP [55]	64.8	73.5	40.6	22.0	12,980	50,635	794
	CNNMTT [23]	65.2	62.2	32.4	21.3	6,578	55,896	946
	POI [49]	66.1	65.1	34.0	21.3	5,061	55,914	805
	GSDT [42]	66.7	69.2	38.6	19.0	14,754	45,057	959
	TubeTK [27]	66.9	62.2	39.0	18.1	11,544	47,502	1,236
	LM_CNN [1]	67.4	61.2	38.2	19.2	10,109	48,435	931
	Chain-Tracker [29]	67.6	57.2	32.9	23.1	8,934	48,305	1,897
	KDNT(POI) [49]	68.2	60.0	41.0	19.0	11,479	45,605	933
	FairMOT [51]	69.3	72.3	40.3	16.7	13,501	41,653	815
	QuasiDense [28]	69.8	67.1	41.6	19.8	9,861	44,050	1,097
	TraDeS [45]	70.1	64.7	37.3	20.0	8,091	45,210	1,144
	LMP_p [37]	71.0	70.1	46.9	21.9	7,880	44,564	434
	PatchTrack (Ours)	73.3	65.8	45.7	11.3	10,660	36,824	1,179

Table 1: Evaluation on the MOT16 test set. We evaluate recent MOT systems on the MOT16 test set in the private detection protocol. The method names are taken directly from the leaderboard of motchallenge, where the names in parentheses are associated with their respective literatures. Metrics with

\uparrow

means higher numbers are preferable, while the ones with

\downarrow

means lower numbers are preferable. Numbers are marked in bold if they are the best in their respective metric columns. Our proposed PatchTrack achieves best results in MOTA, ML, and FN.

Dataset	(CNN-based) method	MOTA $\uparrow$	IDF1 $\uparrow$	MT $\uparrow$	ML $\downarrow$	FP $\downarrow$	FN $\downarrow$	IDsw $\downarrow$
MOT17	DAN [36]	52.4	49.5	21.4	30.7	25,423	234,592	8,431
	TubeTK [27]	63.0	58.6	31.2	19.9	27,060	177,483	4,137
	GSDT [42]	66.2	63.4	36.9	21.7	25,800	164,120	2,711
	Chained-Tracker [29]	66.6	57.4	37.8	18.5	22,284	160,491	5,529
	CenterTrack [53]	67.8	64.7	34.6	24.6	18,498	160,332	3,039
	QuasiDense [28]	68.7	66.3	40.6	21.9	26,589	146,643	3,378
	TraDes [45]	69.1	63.9	36.4	21.5	20,892	150,060	3,555
	MAT [14]	69.5	63.1	43.8	18.9	30,660	138,741	2,844
	SOTMOT [52]	71.0	71.9	42.7	15.3	39,537	118,983	5,184
	RADTrack (RelationTrack) [48]	73.1	73.7	39.9	20.0	25,935	122,700	3,021
	GSDT [42]	73.2	66.5	41.7	17.5	26,397	120,666	3,891
	Semi-TCL [18]	73.3	73.2	41.8	18.7	22,944	124,980	2,790
	FairMOT [51]	73.7	72.3	43.2	17.3	27,507	117,477	3,303
	RelationTrack [48]	73.8	74.7	41.7	23.2	27,999	118,623	1,374
	PermaTrackPr [38]	73.8	68.9	43.8	17.2	28,998	115,104	3,699
	CSTrack [19]	74.9	72.6	41.5	17.5	23,847	114,303	3,567
	PatchTrack (ours)	73.6	65.2	44.6	12.5	23,976	121,230	3,795
	Transformer-based method
	MOTR [50]	65.1	66.4	33.0	25.2	45,486	149,307	2,049
	TrackFormer [24]	65.0	63.9	45.6	13.8	70,443	123,552	3,528
	MOTPrivate (TransCenter) [46]	70.0	62.1	38.9	20.4	28,119	136,722	4,647
	TransCenter [46]	73.2	62.2	40.8	18.5	23,112	123,738	4,614
	TrTrack (TransTrack) [35]	75.2	63.5	55.3	10.2	50,157	86,442	3,603
	PatchTrack (ours)	73.6	65.2	44.6	12.5	23,976	121,230	3,795

Table 2: Evaluation on MOT17 test set. We evaluate recent MOT systems on the MOT17 test set in a private detection protocol. Compared to CNN-based (non Transformer-based) methods, PatchTrack outperforms in MT and ML. We also compare our proposed method with MOT systems that are also Transformer based. Numbers are in bold if they are the best in their respective metric columns, and in blue if they are the second-to-best.

4 Experiments

4.1 Datasets and metrics

MOT MOT benchmarks are among the most widely used multi-object tracking benchmarks. We perform experiments on two of the MOT benchmarks: MOT16 and MOT17 [26]. MOT16 consists of a training set of 7 videos (5,316 frames and 336,891 tracks) and a test set of 7 videos (5,919 frames and 564,228 tracks) with FPS ranging from 14 to 30. To evaluate the performance of the tracking mechanism independently of the detection accuracy, this benchmark also provides public detection from Faster R-CNN [31]. MOT17 consists of the same training set and test set as MOT16, but with additional public detection from DPM [10] and SDP [47]. Both MOT16 and MOT17 are annotated with full-body bounding boxes.

CrowdHuman CrowdHuman [33] is a pedestrian detection benchmark. It contains 15,000 training images and 4,370 validation images with a total of 470K objects. The annotations are also human full-body bounding boxes. This benchmark is often used for pre-training MOT systems.

Metrics MOT benchmarks [17, 26, 7] uses metrics from CLEAR [2], which includes Multiple-Object Tracking Accuracy (MOTA), Identity F1 score (IDF1), Identity Switch (IDsw), False Positive (FP), False Negative (FN) detections, as well as Mostly Tracked (MT) and Mostly Lost (ML) trajectories.

4.2 Training data generation

Given the architecture of PatchTrack (Figure 2), we need two consecutive frames to train the model. Although we could simply take frames pairs, predict track candidates from tracks of the previous frame using Kalman filter [43] as shown in the architecture, Kalman filter would not be able to provide high quality predictions due to high uncertainty in the early stage when there is a lack of prior information, which will in turn degrades the performance of decoder since the patch queries do not serve as good guesses to where existing tracks may be in the current frame.

To simulate the role of Kalman filter [43] and generate track candidates for training, we propose the following augmentation strategy. Given a frame pair $f_{k-1}$ and $f_{k}$ . We first randomly shift and reshape each track bounding box in frame $f_{k-1}$ within a pre-defined domain. We ensure that the IoU between each augmented bounding box and the track bounding box in frame $f_{k}$ with the same tracking ID, if exists, is at least 0.5. This is to align with commonly used IoU threshold value in detection association [44, 3, 51]. These augmented tracks are the track candidates to our system during training.

We also adapt the track augmentation strategy from Trackformer [24], where we introduce false negatives by removing some queries associated with tracks that exist in both $f_{k-1}$ and $f_{k}$ from the input. The objective of the system is to detect the corresponding objects as new objects using object queries. On the other end, we sample output embeddings (generated from performing object detection on $f_{k-1}$ ) that map to background bounding boxes. They are included in the track queries as false positives when performing object tracking on $f_{k}$ . To obtain their corresponding patch queries, we get their respective bounding boxes and augment them in the same manner as track candidates generation. We ensure that the IoU of each augmented bounding box is below 0.5 with ground truth tracks in $f_{k}$ . For each patch-track queries generated from the above procedure, our system should decode them and get background objects.

Frame pairs are selected from two sources. The first one is video data from MOT benchmarks [26], where we take two video clips within a certain range from each other in the same video. This gives us more variety in terms of camera motion. The second one is image data from CrowdHuman [33], where we augment a single image through random scaling and translating to obtain a frame pair. For each selected frame pair, we perform the aforementioned steps to generate track candidates and modify the ground truth corresponding to false positives/negatives we inserted manually. PatchTrack is optimized towards the modified ground truth during training.

4.3 Implementation details

The Kalman filter [43] following a constant velocity model is used to predict track candidates. PatchTrack uses ResNet-50[15] pre-trained on ImageNet [8] as its CNN backbone and Deformable DETR [56] for the Transformer encoder-decoder framework. The number of object queries is set to be 500. Inactive track queries will be kept for 30 frames for track re-birth.

We adopt the training procedure from TransTrack [35] as follows. The optimizer is AdamW with $\beta_{1}=0.9,\beta_{2}=0.999$ and initial learning rate $2\mathrm{e}{-4}$ . We use 8 NVIDIA Tesla V100 GPUs with batch size 16. PatchTrack is first pre-trained on CrowdHuman [33] for 150 epochs with the learning rate dropped to $2\mathrm{e}{-5}$ after the first 100 epochs. Then, PatchTrack is trained on both CrowdHuman and MOT17 [26] for another 20 epochs. Lastly, it is evaluated on MOT16 and MOT17 [26] test sets.

4.4 Results

MOT16 We compare PatchTrack with other MOT systems on MOT16 [26] test set in private protocol (Table 1), where PatchTrack achieves state-of-the-art results in MOTA, ML, and FN. Compared to LMP_p [37] and POI [49], which collectively achieve best results in the remaining metrics, PatchTrack has significantly lower ML, showing overall better tracking performance. Figure 3 shows additional visual comparison with LMP_p and POI, where PatchTrack is able to track partially occluded objects and distinguish crowded objects better without missing objects or tracking one object multiple times.

MOT17 Table 2 shows quantitative results of PatchTrack along with other recent MOT systems on MOT17 [26] test set in private protocol. Compared to Non-Transformer-based methods, PatchTrack reports best numbers in MT and ML, and shows superior ability in trajectory prediction. On the other hand, PatchTrack performs comparably well with other Transformer-based methods, achieving second-to-best results in most metrics. Compared to TransTrack [35], which has state-of-the-art results in MOTA, MT, ML, and FN, our system is able to produce less than 50% of FP. We provide additional visualizations of PatchTrack and TransTrack in Figure 4. While PatchTrack is able to perform on par with TransTrack, our system is able to avoid tracking one object multiple times or causing ID switches when a previously fully occluded object re-appears.

4.5 Ablation study

The ablation study is performed on the MOT17 [26] validation set. The original MOT17 training set is split to a new training set and validation set, each consisting of the first half and the second half of training videos. After pre-training PatchTrack on CrowdHuman [33], the system is fine-tuned on the both CrowdHuman and the new MOT17 training set and evaluated on the validation set.

Type of queries We evaluate the effect of various queries in Table 3. Removal of only patch queries or track queries means the other is sent to the Transformer decoder along with object queries. Removal of patch-track queries means that the decoder takes in object queries only and essentially behaves like an object detector. After getting individual detections for each frame, we use the Kalman filter [43] and the Hungarian algorithm [16] to associate them. In this case, the modified system falls into the tracking-by-detection paradigm. We see that both patch queries and track queries play an important role in the joint-detection-and-tracking setting. On the other hand, the performance of the tracking-by-detection version of our system is overall comparable with PatchTrack, but produces more ID switches.

Method	MOTA	MT	ML	IDsw
w/o patch queries	71.4	165	42	214
w/o track queries	66.3	141	61	248
w/o patch-track queries	72.0	176	40	200
PatchTrack	72.1	176	40	192

Table 3: Ablation study on type of query inputs. We send different types of query inputs to our system and evaluate their effects. The results suggest the positive effect of patch queries and track queries. When the system doesn’t use patch-track queries and behave as an object detector, where we use Kalman filter [43] and Hungarian algorithm [16] to associate predicted detections, the system produces more ID switches.

Source of frame patches We also evaluate patch queries generated from different sources. The previous bboxes patches come directly from cropping the current frame of interest using bounding boxes of tracks in the previous frame. Alternatively, the previous frame patches are generated using both the previous frame and bounding boxes of tracks in the previous frame. From Table 4, we see similar results when using patches from the previous frame compared to using track queries alone, meaning that patches from the previous frame contains similar information to track queries. On the other hand, patches generated from the current frame with bounding boxes of tracks in the previous frame degrade the performance. We reason that it is because of the misalignment between the frame and bounding boxes, which leads to less useful information in patches.

Method	MOTA	MT	ML	IDsw
w/o patch query	71.4	165	42	214
previous bboxes	62.8	137	69	258
previous frame	71.4	165	42	214
PatchTrack	72.1	176	40	192

Table 4: Ablation study on source of frame patches. We test patch queries generated from different sources. When the patches come from cropping the current frame using the track bounding boxes from the previous frame (previous bboxes), the corresponding patch queries have a negative effect on the performance.

5 Conclusion

We present PatchTrack, a Transformer-based joint-detection-and-tracking system using frame patches. By generating patch queries from the current frame of interest and track predictions using a motion model, we obtain information about object motion and appearance that is associated with the current frame. This novel way of using visual information in the current frame adds additional information to track queries that are derived from previous frames. By using both queries collectively, PatchTrack is able to achieve competitive results on MOT benchmarks.

References

[1] Maryam Babaee, Zimu Li, and Gerhard Rigoll. A dual cnn–rnn for multiple people tracking. Neurocomputing, 368:69–83, 2019.
[2] Keni Bernardin and Rainer Stiefelhagen. Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing, 2008:1–10, 2008.
[3] Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and Ben Upcroft. Simple online and realtime tracking. In 2016 IEEE international conference on image processing (ICIP), pages 3464–3468. IEEE, 2016.
[4] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6154–6162, 2018.
[5] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European Conference on Computer Vision, pages 213–229. Springer, 2020.
[6] Zhigang Dai, Bolun Cai, Yugeng Lin, and Junying Chen. Up-detr: Unsupervised pre-training for object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1601–1610, 2021.
[7] Patrick Dendorfer, Hamid Rezatofighi, Anton Milan, Javen Shi, Daniel Cremers, Ian Reid, Stefan Roth, Konrad Schindler, and Laura Leal-Taixé. Mot20: A benchmark for multi object tracking in crowded scenes. arXiv preprint arXiv:2003.09003, 2020.
[8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
[9] Kuan Fang, Yu Xiang, Xiaocheng Li, and Silvio Savarese. Recurrent autoregressive networks for online multi-object tracking. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 466–475. IEEE, 2018.
[10] Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan. Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence, 32(9):1627–1645, 2009.
[11] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
[12] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
[13] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
[14] Shoudong Han, Piao Huang, Hongwei Wang, En Yu, Donghaisheng Liu, Xiaofeng Pan, and Jun Zhao. Mat: Motion-aware multi-object tracking. arXiv preprint arXiv:2009.04794, 2020.
[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[16] István Kenesei, Robert M Vago, and Anna Fenyvesi. Hungarian. Routledge, 2002.
[17] Laura Leal-Taixé, Anton Milan, Ian Reid, Stefan Roth, and Konrad Schindler. Motchallenge 2015: Towards a benchmark for multi-target tracking. arXiv preprint arXiv:1504.01942, 2015.
[18] Wei Li, Yuanjun Xiong, Shuo Yang, Mingze Xu, Yongxin Wang, and Wei Xia. Semi-tcl: Semi-supervised track contrastive representation learning. arXiv preprint arXiv:2107.02396, 2021.
[19] Chao Liang, Zhipeng Zhang, Yi Lu, Xue Zhou, Bing Li, Xiyong Ye, and Jianxiao Zou. Rethinking the competition between detection and reid in multi-object tracking. arXiv preprint arXiv:2010.12138, 2020.
[20] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
[21] Xufeng Lin, Chang-Tsun Li, Victor Sanchez, and Carsten Maple. On the detection-to-track association for online multi-object tracking. Pattern Recognition Letters, 146:200–207, 2021.
[22] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
[23] Nima Mahmoudi, Seyed Mohammad Ahadi, and Mohammad Rahmati. Multi-target tracking using cnn-based features: Cnnmtt. Multimedia Tools and Applications, 78(6):7077–7096, 2019.
[24] Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, and Christoph Feichtenhofer. Trackformer: Multi-object tracking with transformers. arXiv preprint arXiv:2101.02702, 2021.
[25] Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, and Jingdong Wang. Conditional detr for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3651–3660, 2021.
[26] Anton Milan, Laura Leal-Taixé, Ian Reid, Stefan Roth, and Konrad Schindler. Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831, 2016.
[27] Bo Pang, Yizhuo Li, Yifan Zhang, Muchen Li, and Cewu Lu. Tubetk: Adopting tubes to track multi-object in a one-step training model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6308–6318, 2020.
[28] Jiangmiao Pang, Linlu Qiu, Xia Li, Haofeng Chen, Qi Li, Trevor Darrell, and Fisher Yu. Quasi-dense similarity learning for multiple object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 164–173, 2021.
[29] Jinlong Peng, Changan Wang, Fangbin Wan, Yang Wu, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Yanwei Fu. Chained-tracker: Chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking. In European Conference on Computer Vision, pages 145–161. Springer, 2020.
[30] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
[31] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28:91–99, 2015.
[32] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 658–666, 2019.
[33] Shuai Shao, Zijian Zhao, Boxun Li, Tete Xiao, Gang Yu, Xiangyu Zhang, and Jian Sun. Crowdhuman: A benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123, 2018.
[34] Bing Shuai, Andrew Berneshawi, Xinyu Li, Davide Modolo, and Joseph Tighe. Siammot: Siamese multi-object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12372–12382, 2021.
[35] Peize Sun, Yi Jiang, Rufeng Zhang, Enze Xie, Jinkun Cao, Xinting Hu, Tao Kong, Zehuan Yuan, Changhu Wang, and Ping Luo. Transtrack: Multiple-object tracking with transformer. arXiv preprint arXiv:2012.15460, 2020.
[36] ShiJie Sun, Naveed Akhtar, HuanSheng Song, Ajmal Mian, and Mubarak Shah. Deep affinity network for multiple object tracking. IEEE transactions on pattern analysis and machine intelligence, 43(1):104–119, 2019.
[37] Siyu Tang, Mykhaylo Andriluka, Bjoern Andres, and Bernt Schiele. Multiple people tracking by lifted multicut and person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3539–3548, 2017.
[38] Pavel Tokmakov, Jie Li, Wolfram Burgard, and Adrien Gaidon. Learning to track with object permanence. arXiv preprint arXiv:2103.14258, 2021.
[39] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
[40] Balaji Veeramani, John W Raymond, and Pritam Chanda. Deepsort: deep convolutional networks for sorting haploid maize seeds. BMC bioinformatics, 19(9):1–9, 2018.
[41] Xingyu Wan, Jinjun Wang, Zhifeng Kong, Qing Zhao, and Shunming Deng. Multi-object tracking using online metric learning with long short-term memory. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 788–792. IEEE, 2018.
[42] Yongxin Wang, Kris Kitani, and Xinshuo Weng. Joint object detection and multi-object tracking with graph neural networks. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13708–13715. IEEE, 2021.
[43] Greg Welch, Gary Bishop, et al. An introduction to the kalman filter. 1995.
[44] Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple online and realtime tracking with a deep association metric. In 2017 IEEE International Conference on Image Processing (ICIP), pages 3645–3649. IEEE, 2017.
[45] Jialian Wu, Jiale Cao, Liangchen Song, Yu Wang, Ming Yang, and Junsong Yuan. Track to detect and segment: An online multi-object tracker. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12352–12361, 2021.
[46] Yihong Xu, Yutong Ban, Guillaume Delorme, Chuang Gan, Daniela Rus, and Xavier Alameda-Pineda. Transcenter: Transformers with dense queries for multiple-object tracking. arXiv preprint arXiv:2103.15145, 2021.
[47] Fan Yang, Wongun Choi, and Yuanqing Lin. Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2129–2137, 2016.
[48] En Yu, Zhuoling Li, Shoudong Han, and Hongwei Wang. Relationtrack: Relation-aware multiple object tracking with decoupled representation. arXiv preprint arXiv:2105.04322, 2021.
[49] Fengwei Yu, Wenbo Li, Quanquan Li, Yu Liu, Xiaohua Shi, and Junjie Yan. Poi: Multiple object tracking with high performance detection and appearance feature. In European Conference on Computer Vision, pages 36–42. Springer, 2016.
[50] Fangao Zeng, Bin Dong, Tiancai Wang, Cheng Chen, Xiangyu Zhang, and Yichen Wei. Motr: End-to-end multiple-object tracking with transformer. arXiv preprint arXiv:2105.03247, 2021.
[51] Yifu Zhang, Chunyu Wang, Xinggang Wang, Wenjun Zeng, and Wenyu Liu. Fairmot: On the fairness of detection and re-identification in multiple object tracking. arXiv preprint arXiv:2004.01888, 2020.
[52] Linyu Zheng, Ming Tang, Yingying Chen, Guibo Zhu, Jinqiao Wang, and Hanqing Lu. Improving multiple object tracking with single object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2453–2462, 2021.
[53] Xingyi Zhou, Vladlen Koltun, and Philipp Krähenbühl. Tracking objects as points. In European Conference on Computer Vision, pages 474–490. Springer, 2020.
[54] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Objects as points. arXiv preprint arXiv:1904.07850, 2019.
[55] Zongwei Zhou, Junliang Xing, Mengdan Zhang, and Weiming Hu. Online multi-target tracking with tensor-based high-order graph matching. In 2018 24th International Conference on Pattern Recognition (ICPR), pages 1809–1814. IEEE, 2018.
[56] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.