A Bayesian Detect to Track System for Robust Visual Object Tracking and Semi-Supervised Model Learning

Yan Shen Department of Computer Science and Engineering, University at Buffalo,
The State University of New York, Buffalo, NY, USA
{yshen22,zhanghex,chunweim,mgao8}@buffalo.edu Zhanghexuan Ji Department of Computer Science and Engineering, University at Buffalo,
The State University of New York, Buffalo, NY, USA
{yshen22,zhanghex,chunweim,mgao8}@buffalo.edu Chunwei Ma Department of Computer Science and Engineering, University at Buffalo,
The State University of New York, Buffalo, NY, USA
{yshen22,zhanghex,chunweim,mgao8}@buffalo.edu Mingchen Gao Department of Computer Science and Engineering, University at Buffalo,
The State University of New York, Buffalo, NY, USA
{yshen22,zhanghex,chunweim,mgao8}@buffalo.edu

Abstract

Object tracking is one of the fundamental problems in visual recognition that achieves significant improvements in recent years. The achievements often come with the price of enormous hardware consumption and extensive labeling effort. One missing ingredient for robust tracking is gaining performance with minimum modification on network structure and model learning from intermittent labeled frames. In our work, we address these problems by modeling tracking and detection process in a probabilistic way as multi-object dynamics and frame detection uncertainties. Our stochastic model is formulated as a neural network parameterized distributions. With our formulation, we propose a particle filter-based tracking algorithm for object state estimation. We also present a semi-supervised learning algorithm from intermittent labeled frames by Variation Sequential Monte Carlo. We use our generated particles for estimating a variational bound as our learning objectives. In our experiments, we provide both mAP and probability-based detection measurements for comparison between our algorithm with non-Bayesian baselines. Our model outperforms non-Bayesian baselines from both measurements. We also apply our semi-supervised learning algorithm on M2Cai16-Tool-Locations dataset and outperforms the baseline methods of learning on labeled frames only.

1 Introduction

Visual object detection and tracking covers a large spectrum of computer vision applications such as video surveillance, motion analysis, action recognition, autonomous driving and medical operation studies. The emergence of deep Convolutional Neural Networks (CNN) [11] makes a tremendous progress on the visual object detection and tracking performances. CNN is widely utilized for theses tasks for two reasons. Firstly, CNN learns robust object features cross total variations among the whole training dataset.

Refer to caption — Figure 1: Several challenges exist for a robust tracking and detection system. (a) occlusions, (b) clutter detection, (c) motion blur.

Secondly, CNN’s shift covariance properties allow generating regional proposals from the area of maximum responses on object specific detection filters.

A common deep CNN-based video object tracking system consists of two parts of object detection and object displacement networks. R-CNN [15] network is the most commonly used backbone structures for object detection. Generally, a R-CNN network consists of two stages. In the first stage, a RPN network generates object likeness score and bounding box offset coordinates from a fixed number of predefined anchors on the output feature maps. In the second stage, plausible candidate regions of high object likeness score are pooled on another feature maps for refined coordinates offsets and object class score. For object displacement predictions, a correlation layer takes the features from Siamese network on reference frames and prediction frames for object displacement predictions. Several works [10, 6, 22] refine the results by proposing multi-stages of regional proposal detection, feature propagation, objects tube linking and post-processing.

The main challenge for robust tracking includes motion blur, partial occlusions and background clutters. This poses challenges for developing robust tracking algorithms by adding uncertainties for object state estimations. Here we show some typical challenges for robust detection and tracking in Fig. 1 In response to the increasing concerns about model robustness and generalizations without introducing extra cost, we see a resurgence of Bayesian models in recent years. Bayesian approaches use a probabilistic treatment of object appearances and states to deal with the uncertainties in prediction from models. However there are several challenges for taking existing works off-the-shelf for learning and making inference on network model outputs. Firstly, there are multiple objects appear and disappear in consecutive frames. Secondly, different objects linking possibilities exists for tracking objects across different frames. Finally, modeling the uncertainties of dynamic number of appeared object’s states from a fixed R-CNN structure’s outputs is still not well solved. In our work, we propose to address the first and the second problem by formulating a joint object dynamic over the distributions of a cascaded event of object appearing/disappearing, new object arriving and object associations. And we address the third problem by considering the R-CNN outputs as a clustered emission distributions from object’s ground states over appearance scores, classification scores and location coordinates.

In this paper, we formulate the problem of multi-object detection and tracking in a fully probabilistic way. Our model is parameterized by tracking and detection neural networks. We take the original network structure from the original detect to track paper [2] with minimum modification in notations and definitions. Our formulation consists of a transition models for objects dynamics and emission model for object detection. We take neural network’s outputs as our transition and emission distribution’s parameters. Our probabilistic model is capable of handling the multi-object appearing and disappearing problem by incorporating an object appearance and association prior. In our model, we see tracking and detection as inferring objects states posteriors from network outputs. And the posterior inference is proposed by our particle filter based sampling algorithm. We use our particle filter algorithm to generate samples from an approximated family of distributions and later re-weigh and re-sample in accordance with our model formulations. Our sampled trajectories is capable of taking the relations cross all visible frames into account. Finally, we present a Variation Sequential Monte Carlo (VSMC) [14] method to learn our model from intermittent labeled frames. Our VSMC method trains on unlabeled consecutive frames by optimizing a tractable evidence lower bound (ELBO). The ELBO is approximated from sampled labels generated by our particle filter algorithm. Our contribution in this paper are three folds.

•

We give a deep neural network parameterized Bayesian formulation of a detection and tracking system.
•

We propose a particle filter sampling algorithm to make robust estimation on tracking object states from neural network outputs.
•

We present a weakly-supervised training algorithm for our model by VSMC methods. Our VSMC algorithm trains on both labeled frames and unlabeled consecutive frames with the pseudo-label generated by our sampling algorithm.

2 Related Work

A comprehensive review of tracking and detection algorithm is beyond the scope of this paper. We review the works that are most related to our study.

Object Detection and Tracking Currently, state-of-art object detection and tracking systems consist of multiple stages of regional proposal detection, feature propagation, object tube linking and post-processing. In the first stage is to extract regional proposal candidates. Representing works include 2D R-CNN network in frame-level box proposal [6], 3D R-CNN network in video-level tube proposal [10] and feature propagation [22]. In the second stage, detected objects are linked together to make a tracking prediction. Either direct detection box [20] or tracking displacement [2] offset box are linked by bipartite graph algorithm [4] or Viterbi algorithm [3]. In the final stage, suppression methods are used for removing duplications and false positives.

Correlation Filter and Siamese Network Correlation filters have recently been introduced into visual tracking and shown to achieve high speed as well as robust performance [1, 17]. Siamese network with triplet loss has also shown advantages in clustering similar object samples, which has been commonly used in representation learning[8]. Bo [13] proposes a Siamese-RPN network that are trained end-to-end to regress tracking template locations. Bo [12] improves the performance by using a ResNet-50 backboned CNN and taking RPN predictions by aggregating multi-layer outputs. Zhipeng [21] proposes a deeper and wider Siamese network by adding multiple crops to remove padding effects. Instead of cropping the objects on template images, Feichtenhofer [2] uses ROI poolings on correlation feature maps for object displacement prediction from template to target.

Bayesian Model for Tracking and Detection Bayesian methods have been applied to object detection and tracking for its robust performance to deal with object state uncertainties. Tianzhu [19] combines the power of particle filter and correlation filter for robust tracking objects on image and video frames. They solve a correlation filter in dual space by accelerated proximal gradient method. And then they propose a particle filter tracker to generate particles using transition model, apply proposed correlation filter to shift it to a stable location and reweights the sample using the filtered responses. However their correlation filter is based on the shallow features in Fourier domains. Ali [7] combines the power of Bayesian inference and CNN for dealing with uncertainties of deep neural network’s observations. They assume a Gaussian prior on object location and Dirichlet prior on object categories. Xinshuo [18] proposes 3D object detection baseline. They use Gaussian distribution to formulate the object location uncertainties and use Kalman Filter to infer the trajectories of each single objects.

3 A Bayesian Formulation of Object Detection and Tracking System

In this section, we give a bayesian formulation of object detection and tracking, We consider the joint process of object detection and tracking as a HMM, where the distribution of object states cross frames are modeled as hidden states. We view tracking as the transition between neighboring hidden states and detection as emission from hidden states to a noisy visible states observed by R-FCN network outputs. We would show that in the case of supervised learning, the traditional tracking and detection loss takes a similar form of the maximum likelihood function in our formulated model.

A Definition and Notations

We denote the state’s of a object indexed at $i$ at frame $t$ as $\mathcal{B}_{i,t}$ . And $\mathcal{B}_{i,t}$ is defined as as a tuple of three variables.

\mathcal{B}_{i,t}\triangleq\{L_{i,t},\mathcal{C}_{i}\}

(1)

where $L_{i,t}$ denotes the vector of bounding box’s locations, $\mathcal{C}_{i}$ denotes the category of the object.

The category of the object $\mathcal{B}_{i,t}$ keeps unchanged cross all the frames $\mathcal{B}_{i,t}$ appears. And the $L_{i,t}$ location shifts between frames. We model the transitional distributions of object i between frame $t$ and $t+1$ as a neural network parameterized Gaussian distributions.

p_{g}(L_{i,t+1}|L_{i,t})\sim\mathcal{N}(g^{\phi}_{\mu}(\mathcal{B}_{i,t}),g^{\phi}_{\sigma}(\mathcal{B}_{i,t}))

(2)

where $g^{\phi}_{\mu}(\mathcal{B}_{i,t}),g^{\phi}_{\sigma}(\mathcal{B}_{i,t})$ is given by the RoI output’s from the deep correlational kernel network in Feichtenhofer[2].

And at each frame $t$ , the R-FCN gives M anchored observations as $[\hat{\mathcal{B}}_{1,t},...\hat{\mathcal{B}}_{M,t}]$ . Each anchored observations is either anchored with a true object $\mathcal{B}_{i,t}$ with above definition or anchored with a clutter observations (false positive). We make a simplified assumption that each box output corresponds to a unique objects in the ground truth state. Actually, this is the most widely adopt assumptions in traditional R-FCN training and inference. In our simplified assumptions, we introduce an anchor to objects variable $\mathbf{u}$ which is given by

u_{i}=k\in\{1,2...K_{t}\}

(3)

where the R-FCN’s anchored observation $\hat{\mathcal{B}}_{i,t}$ observation is clustered around objects $\mathcal{B}_{k,t}$ . In our model, the emission distributions from clusters $\mathcal{B}_{k,t}$ to anchored observations also follows a network parameterized Gaussian distributions

p_{f}(\hat{L}_{i,t}|L_{k,t})\sim\mathcal{N}(L_{k,t},f_{\theta}(\mathcal{B}_{k,t}))

(4)

To distinguish between ground truth observation and clutter observation, we use another binary variable $\mathcal{E}_{i,t}$ which is given by

\mathcal{E}_{i,t}=\begin{cases}0&\quad\mathcal{B}_{k,t}\text{is associated with one real object}\\ 1&\quad\mathcal{B}_{k,t}\text{is a clutter observations}\end{cases}

(5)

B Tracking as Object Dynamics and Association

We assume that there are $K_{t}$ objects appears at frame $t$ . At next frame, each object remain/disappears independently at death probability $\lambda_{D}$ with $\hat{K}_{t}$ remaining objects and $\Delta K_{t}$ new coming target arrive at the rate of $\lambda_{L}$ following Poisson distributions. The whole object appearance dynamics could be written as

p(\hat{K}_{t},\Delta K_{t})={\lambda_{D}}^{K_{t}-\hat{K}_{t}}{(1-\lambda_{D})}^{\hat{K}_{t}}\frac{\lambda_{L}^{\Delta K_{t}}e^{-\lambda_{L}}}{\Delta K_{t}!}

(6)

Considering all of the objects are observed in an unknown order, we formulate the distribution of a reordering of $\hat{K}_{t+1}+\Delta K_{t}$ objects by introducing a measurement to target association(M $\to$ T) hypothesis as $\lambda=(\mathbf{r},\hat{K}_{t+1},\hat{K}_{t+1}+\Delta K_{t})$ . And the element of association vector is defined as $\mathbf{r}=(r_{1},r_{2},r_{3},...r_{\hat{K}_{t+1}})$ which is given by

r_{j}=k\in\{1,2,3...\hat{K}_{t+1}+\Delta K_{t}\}

(7)

where the j’s object at the frame $t$ is associated with k’s object at frame $t+1$ . And we also assign a uniform prior for the association vector $\mathbf{r}$

p(\mathbf{r})=\binom{\hat{K}_{t}+\Delta K_{t}}{\hat{K}_{t}}^{-1}

(8)

The newly appeared object $\mathcal{B}_{j,t+1}$ follows a Gaussian prior on object locations $L_{j,t+1}$ and a uniform prior on object categories $\mathcal{C}_{j}$

	$\displaystyle p_{0}(L_{j,t+1})$	$\displaystyle\sim\mathcal{N}(\mathbf{\mu}_{0},\Sigma_{0})$
	$\displaystyle p_{0}(\mathcal{C}_{j})$	$\displaystyle\sim F(k,0,K-1)$		(9)

For existing object $\mathcal{B}_{j,t+1}$ , its location $L_{j,t+1}$ is updated by following transitional distribution defined in Equation.2 which is parameterized by tracking network. And the categories $\mathcal{C}_{j}$ keeps the same with its previous state.

With the above definition, the tracking process is modeled as the joint transition probability of object dynamics and associations

		$\displaystyle p(\{\mathcal{B}_{1,t+1}...\mathcal{B}_{\hat{K}_{t+1}+\Delta K_{t},t+1}\},\mathbf{r}\|\{\mathcal{B}_{1,t}...\mathcal{B}_{K_{t},t}\})$
	$\displaystyle=$	$\displaystyle p(\mathbf{r})p(\hat{K}_{t},\Delta K_{t})$
		$\displaystyle\prod_{i=1}^{\hat{K}_{t}}p_{g}(L_{r_{i},t+1}\|L_{i,t}))I(\mathcal{C}_{r_{i},t+1}\|\mathcal{C}_{i,t})\prod_{i\notin\mathbf{r}}p_{0}(\mathcal{B}_{i})p_{0}(\mathcal{C}_{i})$

where $I(\cdot)$ is an indicator function where $I(\mathcal{C}_{r_{i},t+1}|\mathcal{C}_{i,t})=1$ for $\mathcal{C}_{r_{i},t+1}=\mathcal{C}_{i,t}$ and $I(\mathcal{C}_{r_{i},t+1}|\mathcal{C}_{i,t})=0$ for $\mathcal{C}_{r_{i},t+1}\neq\mathcal{C}_{i,t}$ The whole object dynamics is shown in Fig. 2

C Detection as Object Emission

We consider the emission distribution from a real object $\mathcal{B}_{i,t}$ to its anchored observations $\hat{\mathcal{B}}_{k,t}$ from R-FCN outputs. Recall the properties of a R-FCN, its emissions from the final layer of feature maps consists of three parts as object appearance score $e_{k,t}$ , object classification score $\mathcal{K}_{k_{t}}$ and object location coordinates $L_{k,t}$ .
Instead of viewing the probability output as a distribution over object’s existence and categories directly, we see the R-FCN’s’s a categorial digits outputs as a categorical distribution conditioned on the real category $\mathcal{C}_{i}$ of its belonging object. And in our formulation $\mathcal{K}_{k_{t}}$ follows a Dirichlet distribution with $\alpha$ conditioned on $\mathcal{C}_{i}$

p(\mathcal{\hat{K}}_{i,t}|\mathcal{C}_{i})\sim\textit{Dir}(\alpha_{1},......\alpha_{K})

(10)

in our formulation $\alpha_{1},......\alpha_{K}$ is set as

\alpha_{k}=\begin{cases}\alpha+1&\quad\text{if }k=\mathcal{C}_{i}\\ 1&\quad\text{if }k\neq\mathcal{C}_{i}\end{cases}

(11)

Similarly, we treat the object appearance score $e_{k,t}$ as a categorical distribution in a beta distributions conditioned on $\mathcal{E}_{i,j}$ indicating whether the $\hat{\mathcal{B}}_{k,t}$ is anchored around a real object or a clutter observations.

p(\hat{e}^{M}_{j,t}|\mathcal{E}_{i,t})\sim\textit{Beta}(\alpha_{0},\alpha_{1})

(12)

$\alpha_{0},......\alpha_{1}$ is set as

\alpha_{0},\alpha_{1}=\begin{cases}\alpha+1,1&\quad\text{if }\mathcal{E}_{i,t}=0\\ 1,\alpha+1&\quad\text{if }\mathcal{E}_{i,t}=1\end{cases}

(13)

As for $\hat{\mathcal{B}}_{k,t}$ is anchored around real object $\mathcal{B}_{i,t}$ ,

p(\hat{e}_{j,t}|\mathcal{E}_{i,t})\sim\textit{Beta}(1,\alpha+1)

(14)

For object location coordinates outputs $\hat{L}_{k,t}$ , we consider $\hat{L}_{k,t}$ as a noisy observation of object’s real locations $L_{i,t}$ . $\hat{L}_{k,t}$ follows a Gaussian distribution given by Eq. (4) with its mean as $L_{i,t}$ and variance as $f_{\theta}(\mathcal{B}_{k,t}$ . In our work, we do not treat $f_{\theta}(\mathcal{B}_{k,t})$ as a direct output by R-FCN network. Instead, We follow the line with [7] by viewing covariance as combined contributions of model uncertainty and prediction uncertainty. Similar with their work, we omit the model uncertainty and use prediction uncertainty $\Sigma(\mathcal{B}_{i,t})$ as an approximation for $f_{\theta}(\mathcal{B}_{k,t})$ . The prediction uncertainty $\Sigma(\mathcal{B}_{i,t})$ is taking as the covariance of all output coordinate predictions $\hat{L}_{k,t}$ for all anchors around $\mathcal{B}_{k,t}$ .

	$\displaystyle\mu(\mathcal{B}_{i,t})$	$\displaystyle=\frac{1}{M_{i}}\sum_{u_{j}=i}f^{\theta}(\mathcal{\hat{B}}_{j,t})$		(15)
	$\displaystyle\Sigma(\mathcal{B}_{i,t})$	$\displaystyle\approx\frac{\alpha}{M_{i}}(\sum_{u_{j}=i}f^{\theta}(\mathcal{\hat{B}}_{j,t})f^{\theta}(\mathcal{\hat{B}}_{j,t})^{T})-\alpha\mu(\mathcal{B}_{i,t})\mu(\mathcal{B}_{i,t})^{T}$		(16)

For anchored observation $\hat{\mathcal{B}}_{k,t}$ that associates with a clutter $\hat{\mathcal{B}}^{0}_{i,t}$ , we assume a Gaussian prior for clutter location and uniform prior on clutter categories in the same form as Eq. (B).

With the above definition, the joint emission probability is given by

	$\displaystyle p(\{\mathcal{\hat{B}}_{1,t}...\mathcal{\hat{B}}_{M,t}\}\|\{\mathcal{B}_{1,t}...\mathcal{B}_{K_{t},t}\})$
	$\displaystyle=\Gamma\prod_{i\in(\hat{E}_{i,t}=1)}\hat{e}_{j,t}^{\alpha}\mathcal{N}(f^{\theta}(\hat{L}_{i,t});L_{u_{i},t},f_{\theta}(\mathcal{B}_{k,t}))c_{i}^{\alpha}$
	$\displaystyle\sum_{L_{u_{i},t}}\sum_{C_{u_{i}}}\prod_{i\in(\hat{E}_{i,t}=0)}(1-\hat{e}_{j,t})^{\alpha}\mathcal{N}(f^{\theta}(\hat{L}_{i,t});L_{u_{i},t},f_{\theta}(\mathcal{B}_{k,t}))c_{i}^{\alpha}$
	$\displaystyle\hskip 6.00006ptp_{0}(L_{j,t})p_{0}(C_{u_{i}})$

where $c_{i}=\hat{K}_{i,t}(\mathcal{C}_{u_{i}})$ is the categorical score for its associated object’s class. Because we do not infer on its location and categories for negative anchors, we take the marginalized distribution for categorical and location output score for negative anchors. The whole R-FCN detection process is shown in Fig. 3

D Traditional D&T Loss as Model Likelihood

In the case of supervised learning, we have annotations of all hidden states. And all the visual states and distribution are parameterized by the network outputs. All the association variable is also annotated for tracking. We train the network by taking the log likelihood of joint probability as

$\displaystyle\mathcal{L}$	$\displaystyle=\sum_{t=1}^{T}\log p(\{\mathcal{\hat{B}}_{1,t}...\mathcal{\hat{B}}_{M,t}\}\|\{\mathcal{B}_{1,t}...\mathcal{B}_{K_{t},t}\})$
	$\displaystyle+\sum_{t=1}^{T-1}\log p(\{\mathcal{B}_{1,t+1}...\mathcal{B}_{\hat{K}_{t+1}+\Delta K_{t},t+1}\},\mathbf{r}_{t}\|\{\mathcal{B}_{1,t}...\mathcal{B}_{K_{t},t}\})$
	$\displaystyle=\sum_{i\in\{r_{i}\in\{1:K_{t+1}\}\},t}\frac{1}{2}\|\|L_{r_{i},t+1}-g^{\phi}_{\mu}(\mathcal{B}_{i,t})\|\|_{g^{\phi}_{\sigma}(\mathcal{B}_{i,t})^{-1}}+\frac{1}{2}\log(g^{\phi}_{\sigma}(\mathcal{B}_{i,t}))$
	$\displaystyle+\sum_{i=1,t}^{i\leq M}\frac{1}{2}\|\|\|\hat{L}_{i,t}-L_{u_{i},t}\|\|_{\Sigma^{-1}}+\frac{1}{2}\log\det({\Sigma^{-1}(\mathcal{B}_{i,t})})$
	$\displaystyle+H(e_{u_{i}},\mathcal{E}_{u_{i}})+H(\mathcal{K}_{i,t},\mathcal{C}_{u_{i}})+const$
	$\displaystyle=\mathcal{L}_{track}+\mathcal{L}_{detect}+const$	(17)

where $\mathcal{L}_{track}$ and $\mathcal{L}_{detect}$ is our commonly used training loss for deep correlational kernel tracking network(or Siamese Network) and R-FCN detection network. The only difference is that we introduce a $g^{\phi}_{\sigma}(\mathcal{B}_{i,t})$ as a parameterized network outputs indicating the independent variance for tracking coordinate prediction. And we put the term $\frac{1}{2}||L_{r_{i},t+1}-g^{\phi}_{\mu}(\mathcal{B}_{i,t})||_{g^{\phi}_{\sigma}(\mathcal{B}_{i,t})^{-1}}+\frac{1}{2}\log(g^{\phi}_{\sigma}(\mathcal{B}_{i,t}))$ for joint training of tracking mean and variance. We use traditional detection loss for training detection network and omits the term $\frac{1}{2}\log\det({\Sigma^{-1}(\mathcal{B}_{i,t})})$ in our training. Here we also omit the term of marginal distribution of location and categories for clutter anchors.

4 A Particle Filter Object State Estimation Algorithm For Robust Tracking

With our formulation in previous section, our robust tracking is by inferring the posterior of all objects states and their associations cross frames from the observations on R-FCN and tracking dynamics on tracking network. As our probabilistic formulation is straightforward for joint probability verification but less straightforward for sampling posterior, directly inferring the joint states of objects and their associations would be intractable. An analytical solution would require traversing all the possible object appearing and association states which is impractical for implementation. Here we give a particle filter based sampling solution by sampling from an approximated family of straightforward sampling distributions.

Initial Sample on R-FCN Detectors Assume that R-FCN outputs M anchored prediction of bounding box state $\{\mathcal{\hat{B}}_{1,t}...\mathcal{\hat{B}}_{M,t}\}$ at frame $t$ . And we sample $\{\mathcal{B}_{1,t}...\mathcal{B}_{K_{t},t}\}$ according the the posterior

r_{1}(\{\mathcal{B}_{1,t}...\mathcal{B}_{K_{t},t}\})\propto p(\{\mathcal{\hat{B}}_{1,t}...\mathcal{\hat{B}}_{M,t}\}|\{\mathcal{B}_{1,t}...\mathcal{B}_{K_{t},t}\})

And we sample $r_{1}(\{\mathcal{B}_{1,t}...\mathcal{B}_{K_{t},t}\})$ in the following way(the proof would be given in the later part of this section). First, per-anchor outputs from the neural network are clustered in a similar way with NMS. Greedy clustering is performed using the output category score by choosing the anchor with the highest non-background score as the cluster center, adding any anchor with an intersection over union(IOU) greater than 0.5 to the cluster, and eliminating all members in the cluster from the original updated anchor set. This process is repeated until all the anchors are assigned to clusters. By seeing each cluster as a object candidate, we have candidates objects of $\{\tilde{\mathcal{B}}_{1,t}...\tilde{\mathcal{B}}_{K_{t},t}\}$ . Then we sample the objects are their states in the following way.

Step 1:

Include object $\tilde{\mathcal{B}}_{i,t}$ with probability $p_{i}$

p_{i}=\prod_{j\in(u_{j}=i)}(\hat{e}_{j})^{\alpha}/(\prod_{j\in(u_{j}=i)}(\hat{e}_{j})^{\alpha}+\prod_{j\in u_{j}=i}(1-\hat{e}_{j})^{\alpha})

Step 2:

Assign initial box state $\mu(\mathcal{B}_{i,t})$ by Eq. (15)
Step 3:

Assign a new tracking id for each $\mathcal{B}_{i,t}$ .
Step 4:

Sample object class $\mathcal{C}_{i}\sim\text{Categorical}(c_{i})$

$c_{i}=\prod_{j\in(u_{j}=i)}\mathcal{K}_{j,k}^{\alpha}/\sum_{k^{\prime}}\prod_{j\in(u_{j}=i)}\mathcal{K}_{j,k^{\prime}}^{\alpha}$

Data Association After we include our sampled objects and their states, we associate our generated initial objects $\{\mathcal{B}_{1,t}...\mathcal{B}_{\tilde{K}_{t},t}\}^{l}$ to the objects in their ancestor particles $\{\mathcal{B}_{1,t-1}...\mathcal{B}_{K_{t-1},t-1}\}^{a_{t-1}^{l}}$ . To match these two sets of objects, we first construct a affinity matrix ${J}$ with dimension of $\tilde{K}_{t}\times K_{t-1}$ . We assign a score to the elements of $J$ using the summation of IoU between objects and their class scores

J_{i,j}=\log IoU(g_{\phi}^{\mu}(\mathcal{B}_{i,t-1}^{a_{t-1}^{l}}),\mu(\mathcal{B}_{j,t}))+\log\mathcal{K}_{j,\mathcal{C}_{i}}

(18)

Then we apply Hungarian algorithm to solve the bipartite graph matching problem in polynomial time. In addition, we reject matching objects $\mathcal{B}_{i,t-1}$ with $\mathcal{B}_{j,t-1}$ when $\mathcal{B}_{j,t}$ when their IoU is below then a given threshold of $IoU_{min}$ . Without loss of generality, we assume to output a set of matched objects $\{\hat{\mathcal{B}}_{1,t}...\mathcal{B}_{\hat{K}_{t},t}\}^{l}$ in frame $t$ with their ancestor $\{\hat{\mathcal{B}}_{r_{1},t-1}...\mathcal{B}_{r_{\hat{K}_{t}},t-1}\}^{a_{t-1}^{l}}$ in frame $t-1$ , along with the unmatched objects $\{\mathcal{B}_{\hat{K}_{t}+1,t}...\mathcal{B}_{\tilde{K}_{t},t}\}^{l}$ .

Independent Proposal Update for Tracking Object After we associate our sampled objects with their ancestors, we update the state of each sampled objects independently by sampling their posterior conditioned on their ancestors. Prior on the transition probability of matched objects, we sample each object’s bounding box state from their corresponding posterior. We use this proposal distribution to minimise the variance of the importance weights

	$\displaystyle r_{2}(\mathcal{B}_{i,t})$	$\displaystyle\propto p(\mathcal{B}_{i,t}\|\mathcal{B}_{r_{i}^{-1},t-1}^{a_{t-1}^{l}})\prod_{j\in u_{j}=i}p(\hat{\mathcal{B}}_{j,t}\|\mathcal{B}_{i,t})$
		$\displaystyle=\mathcal{N}(\mu^{\prime}(\mathcal{B}_{i,t}),\Sigma^{\prime}(\mathcal{B}_{i,t}))$		(19)

where the sufficient statistics of proposal in Eq. (19) can be estimated in closed form as:

$\displaystyle\Sigma^{\prime}(\mathcal{B}_{i,t})$	$\displaystyle=(\mathrm{diag}^{-1}(g_{\phi}^{\sigma}(\mathcal{B}^{a_{t-1}^{l}}_{r_{i}^{-1},t-1}))+M_{i}\Sigma^{-1}(\mathcal{B}_{i,t}))^{-1}$	(20)
$\displaystyle\mu^{\prime}(\mathcal{B}_{i,t})$	$\displaystyle=\Sigma^{\prime}((g_{\phi}^{\sigma}(\mathcal{B}^{a_{t-1}^{l}}_{r_{i}^{-1},t-1}))^{-1}\otimes g_{\phi}^{\mu}(\mathcal{B}^{a_{t-1}^{l}}_{r_{i}^{-1},t-1})$
	$\displaystyle+M_{i}\Sigma^{-1}(\mathcal{B}_{i,t})\mu(\mathcal{B}_{i,t}))$	(21)

where $M_{i}$ is the number of anchored observations for object $\mathcal{B}_{i,t}$ in R-FCN.

Similarly, for unmatched objects $i^{\prime}$ , $\mu^{\prime}_{0}(\mathcal{B}_{i^{\prime},t}),\Sigma^{\prime}_{0}(\mathcal{B}_{i^{\prime},t})$ could be given by

	$\displaystyle\Sigma_{0}^{\prime}(\mathcal{B}_{i^{\prime},t})$	$\displaystyle=(\Sigma^{-1}_{0}+M_{i^{\prime}}\Sigma^{-1}(\mathcal{B}_{i^{\prime},t}))^{-1}$		(22)
	$\displaystyle\mu^{\prime}_{0}(\mathcal{B}_{i^{\prime},t})$	$\displaystyle=\Sigma^{\prime}(\Sigma^{-1}_{0}\mu_{0}+M_{i^{\prime}}\Sigma^{-1}(\mathcal{B}_{i^{\prime},t})\mu(\mathcal{B}_{i^{\prime},t}))$		(23)

Finally, we keep the sampled class $\mathcal{C}_{i^{\prime}}$ for unmatched objects ${i^{\prime}}$ . And we discard the sampled class for matched object $i$ , and assign $\mathcal{C}_{i}$ with the class of its matched ancestor objects $\mathcal{C}_{r^{-1}_{i}}^{a_{t-1}^{l}}$ . Similarly, we keep the tracking id for unmatched objects and assign matched objects $\mathcal{B}_{i,t}$ with an id of $id(\mathcal{B}_{r_{i}^{-1},t-1}^{a_{t-1}^{l}})$

Theorem 1

The importance weight $w_{t}^{l}$ of our sampled objects state $\{\mathcal{B}_{1,t}...\mathcal{B}_{\tilde{K}_{t},t}\}^{l}$ from its ancestor state $\{\mathcal{B}_{1,t-1}...\mathcal{B}_{K_{t-1},t-1}\}^{a_{t-1}^{l}}$ is given by

	$\displaystyle w_{t}^{l}\propto$	$\displaystyle\underbrace{(1-\lambda_{D})^{K_{t-1}-\hat{K}_{t}}\lambda_{D}^{\hat{K}_{t}}}_{\text{Object Appearance}}\underbrace{p((\tilde{K}_{t}-\hat{K}_{t});\lambda_{L})}_{\text{New Object Arriving}}\underbrace{\binom{\tilde{K}_{t}}{\hat{K}_{t}}^{-1}}_{\text{Association Prior}}$
		$\displaystyle e^{-\frac{1}{2}R_{D}}\tau_{D}\underbrace{\prod_{i\in\mathbf{r}}\mathcal{K}_{i,\mathcal{C}_{r^{-1}_{i}}^{a_{t-1}^{l}}}}_{\text{Tracking Class Probability}}$		(24)

where is $R_{D}$ is given by

$\displaystyle R_{D}$	$\displaystyle=\sum_{i\in\mathbf{r}}\underbrace{g_{\phi}^{\mu}(\mathcal{B}^{a_{t-1}^{l}}_{r_{i}^{-1},t-1})^{\circ 2}\odot((g_{\phi}^{\sigma}(\mathcal{B}^{a_{t-1}^{l}}_{r_{i}^{-1},t-1})^{-1})}_{\text{Tracking Transition Prior}}$
	$\displaystyle-\underbrace{\mu_{0}^{T}\Sigma^{-1}_{0}\mu_{0}}_{\text{New Object Prior}}-\underbrace{\mu^{\prime}(\mathcal{B}_{i,t})^{T}\Sigma^{\prime}(\mathcal{B}_{i,t})^{-1}\mu^{\prime}(\mathcal{B}_{i,t})}_{\text{Matched Location Posterior}}$	(25)
	$\displaystyle+\underbrace{\mu^{\prime}_{0}(\mathcal{B}_{i,t})^{T}\Sigma^{\prime}_{0}(\mathcal{B}_{i,t})^{-1}\mu^{\prime}_{0}(\mathcal{B}_{i,t})}_{\text{Unmatched Location Posterior}}$	(26)

$\tau_{D}$ is given by

	$\displaystyle\tau_{D}$	$\displaystyle=\prod_{i\in\mathbf{r}}\|g_{\phi}^{\sigma}(\mathcal{B}^{a_{t-1}^{l}}_{r_{i}^{-1},t-1})\|^{-1/2}_{\otimes}\det(\Sigma_{0})^{1/2}$
		$\displaystyle\det(\Sigma^{\prime}(\mathcal{B}_{i,t})^{-1})^{1/2}\det(\Sigma^{\prime}_{0}(\mathcal{B}_{i,t})^{-1})^{-1/2}$		(27)

5 VSMC Algorithm For Semi-Supervised Model Learning

In this section, we describe our model training algorithm in the case of semi-supervised settings. In such settings, we only have the labeled bounding box on frame $t$ . In the neighboring frames $\{t-T:t-1\}$ and $\{t+1:t+T\}$ , only visual input is given without labeled bounding box. Different from our supervised learning algorithm in sectionD, training on the joint distribution for the unlabled frames is not feasible. Instead, our objective function is set as log likelihood of the marginal distributions where the unseen annotations are marginalized.

$\displaystyle\mathcal{L}$	$\displaystyle=\log p(\{\underbrace{\mathcal{\hat{B}}_{1:M,t-T}\},..\{\mathcal{\hat{B}}_{1:M,t}\},\{\mathcal{\hat{B}}_{1:M,t+T}\}}_{\text{Network Predictions from Data}},\underbrace{\{\mathcal{B}_{1:K_{t},t}\}}_{\text{Labels for frame}t})$
	$\displaystyle=\underbrace{\log p(\{\mathcal{\hat{B}}_{1:M,t}\},\{\mathcal{B}_{1:K_{t},t}\})}_{\mathcal{L}_{detect}\text{at frame}t}$
	$\displaystyle+\underbrace{\log p(\{\mathcal{\hat{B}}_{1:M,t+1}\},...\{\mathcal{\hat{B}}_{1:M,t+T}\}\|\{\mathcal{B}_{1:K_{t},t}\})}_{\text{Forward Marginal Likelihood}}$
	$\displaystyle+\underbrace{\log p(\{\mathcal{\hat{B}}_{1:M,t-T}\},..\{\mathcal{\hat{B}}_{1:M,t-1}\}\|\{\mathcal{B}_{1:K_{t},t}\})}_{\text{Backward Marginal Likelihood}}$
$\displaystyle=$	$\displaystyle\sum_{t^{\prime}=-T,t^{\prime}\neq 0}^{T}\log p(\{\mathcal{\hat{B}}_{1:M,t+t^{\prime}}\}\|\{\mathcal{B}_{1:K_{t+t^{\prime}},t+t^{\prime}}\})$
	$\displaystyle+\hskip 10.00002pt\log p(\{\mathcal{B}_{1:K_{t+t^{\prime}+1},t+t^{\prime}}\}\|\{\mathcal{B}_{1:K_{t+t^{\prime}},t+t^{\prime}-1}\})$
	$\displaystyle+\log p(\{\mathcal{\hat{B}}_{1:M,t}\},\{\mathcal{B}_{1:K_{t},t}\})$	(28)

Eq. (5) is derived from the property of HMM. The total loss objective could be decomposed into three terms. The first term $\log p(\{\mathcal{\hat{B}}_{1:M,t}\},\{\mathcal{B}_{1:K_{t},t}\})$ is the supervised detection loss on the same form as the traditional R-FCN detection loss. And the second term is the marginal likelihood of forward predictions on frame $\{t+1:t+T\}$ . The third term is the marginal likelihood of backward prediction on frame $\{t-T:t-1\}$ . Both marginal likelihood for forward and backward predictions are conditioned on frame $t$ ’s annotation $\{\mathcal{B}_{1:K_{t},t}\}$ as a weakly supervised training signal for the unlabeled neighbor frames. As the backward likelihood could be factorized in the same structure of forward likelihood. We only derive the variational bound for forward likelihood in later part of the section. For the semi-supervised term, training directly is intractable, we use the following surrogate ELBO as a substitution

\vspace{-1em}\tilde{\mathcal{L}}=\sum_{t^{\prime}=1}^{T}\mathbb{E}_{r_{(}\mathcal{B}_{t-1:t-t^{\prime}}^{1:N},a_{1:t^{\prime}-1}^{1:N};\lambda)}[\log(\frac{1}{N}\sum_{l=1}^{N}w_{t+t^{\prime}}^{l})]

(29)

For convenience of understanding the surrogate ELBO loss for semi-supervised learning of our tracking and detection model. We link the surrogate ELBO loss to our widely adopted tracking and detection loss. By taking the gradient, the surrogate loss could be rewritten as

$\displaystyle\nabla\tilde{\mathcal{L}}$	$\displaystyle=\sum_{t^{\prime}=1}^{T}\mathbb{E}_{r_{(}\mathcal{B}_{t-1:t-t^{\prime}}^{1:N},a_{1:t^{\prime}-1}^{1:N};\lambda)}\nabla\log(\frac{1}{N}\sum_{l=1}^{N}w_{t+t^{\prime}}^{l})$
	$\displaystyle=\sum_{t^{\prime}=1}^{T}\mathbb{E}_{r_{(}\mathcal{B}_{t-1:t-t^{\prime}}^{1:N},a_{1:t^{\prime}-1}^{l:N};\lambda)}\sum_{i=1}^{N}\hat{w}_{t+t^{\prime}}^{l}\nabla\log(w_{t+t^{\prime}}^{l})$
	$\displaystyle\approx\sum_{t^{\prime}=1}^{T}\sum_{i=1}^{N}\hat{w}_{t+t^{\prime}}^{l}\nabla\log(w_{t+t^{\prime}}^{l})$
	$\displaystyle\approx\sum_{t^{\prime}=t}^{t+T-1}\sum_{l=1}^{N}\hat{w}_{t^{\prime}+1}^{l}\nabla\log p(\{\mathcal{B}^{l}_{1:K_{t^{\prime}+1},t^{\prime}+1}\}\|\{\mathcal{B}^{l}_{1:K_{t^{\prime}},t^{\prime}}\}^{a_{t^{\prime}}^{l}})$
	$\displaystyle+\hat{w}_{t^{\prime}+1}^{l}\nabla\log p(\mathcal{\hat{B}}^{l}_{1:M,t^{\prime}+1}\}\|\mathcal{B}^{l}_{1:K_{t^{\prime}+1},t^{\prime}+1}\})$
	$\displaystyle-\hat{w}_{t^{\prime}+1}^{l}\nabla\log r(\mathcal{B}^{l}_{1:K_{t^{\prime}+1},t^{\prime}+1};\lambda)$
	$\displaystyle\approx\sum_{t^{\prime}=t}^{t+T-1}\sum_{l=1}^{N}\hat{w}_{t^{\prime}+1}^{l}\nabla\mathcal{L}_{track}^{t^{\prime}+1}(a_{t^{\prime}}^{l}\to l)+\hat{w}_{t^{\prime}+1}^{i}\nabla\mathcal{L}_{detect}^{t^{\prime}+1}(l)$
	$\displaystyle-\hat{w}_{t^{\prime}+1}^{l}\nabla\log r(\mathcal{B}^{l}_{1:K_{t^{\prime}+1},t^{\prime}+1};\lambda)$	(30)

where $\hat{w}_{t+t^{\prime}}^{l}=w_{t+t^{\prime}}^{l}/\sum_{l}w_{t+t^{\prime}}^{l}$ . The surrogate loss could be nicely decomposed into the weighted sum of tracking and detection loss over frame $\{t+1:t+T\}$ minus the log density of sampling distributions on sampled particle’s importance weight. Built on the our particle filter sampling algorithm, our semi-supervised learning algorithm could be easily implemented on traditional tracking and detection loss. The only modification is that we include a set of sampled particles as a proxy for object annotation and take the weighted sum of their supervised loss by their importance weight. And this training step is iterated in the whole training process.To avoid introducing additional variations, we omit gradients of $\log r(\mathcal{B}_{1:K_{t^{\prime}+1},t^{\prime}+1};\lambda)$ . The detailed training loss is shown in Fig. 4.

6 Experiments

To show the effectiveness of incorporating uncertainty treatment in tracking and detection comparing with traditional methods, we compare our performance with non-Bayesian baselines. Our evaluation is on two commonly used datasets.

ImageNet Video Object Detection Dataset (ILSVRC) [16] contains 30 classes in 3862 training and 555 validation videos. The objects have ground truth annotations of their bounding boxes and track IDs.

M2Cai16-Tool-Locations Dataset [9] extends the M2Cai16-tool dataset. It contains 15 videos record at 25 fps of cholecystectomy procedures at the University Hospital of Strasbourg in France. Among those videos, 2532 frames are labeled under the supervision and spot-checking from a surgeon with medical devices including Grasper, Bipolar, Hook, Scissors, Clipper, Irrigator and Specimen Bag.

To have a fair comparison of the Bayesian approach with the baseline, all the methods use the same structure of training and inference network and sharing training configurations in a much similar way. We only introduce an additional loss term for object transition’s covariance matrix (only diagonal elements) in Eq. (17).

A Evaluation Metrics:

Two evaluation metrics are used for the predication uncertainties quantification. We follow the commonly used benchmark for mAP thresholds at IoU value of 0.5. We also use the a probabilistic measurement Probability Based Detection Quality (PDQ) [5] to jointly quantify the bounding box location and categorical uncertainties for detection estimations. The PDQ score increases as the estimated distribution overlaps with both label’s maximum likelihood and uncertainties.

B Ablation Studies

Our proposed method is compared against five different baselines to study the effectiveness of incorporating Bayesian formulation in each individual part of tracking and detection systems. In each baseline, none or partial Bayesian formulations are considered. We refer the five separate baseline methods as: Single R-FCN frame detector (Single R-FCN), Greedy R-FCN box linking (Greedy R-FCN), Greedy tracking offset R-FCN box linking (Greedy D $\&$ T) and frame-wise Bayesian inference (Frame Bayesian) and Kalman filter for single object trajectory linking (Kalman-Link).

Single R-FCN takes Greedy Non-maximum Suppression (NMS) outputs directly from R-FCN, while Frame Bayesian infers object states from all cluster boxes (including suppressed and non-suppressed) only with frame-level priors. Greedy R-FCN links Single R-FCN makes predictions frame-by-frame with bipartite matching on IoU score. Greedy D $\&$ T adds a tracking estimation part from Single R-FCN and links Single R-FCN predictions on tracking offset IoU scores. Kalman-Link links object by bipartite matching and updates box locations by Kalman filter with the trajectories of matched objects.

Table I. shows the results of our methods in comparison with the above five baselines by evaluating on ILSVRC dataset. Our method outperforms all five baselines by mAP and PDQ metrics. At frame level, our methods achieves frame level mAP of $72.1$ and video level mAP of $75.3$ by a margin of $0.2-0.5$ over the second best methods. In the measurement of PDQ, our methods achieve frame level of $39.4$ and video level of $40.2$ by a margin of $0.2-0.4$ over the second base method. Our method has a large margin of performance gain around $8.1$ over the baseline of Single R-FCN by PDQ metrics. In the baseline of Frame Bayesian, the performance could outperform Single R-FCN by a margin of $3.7$ in PDQ by naive inference on a frame-level prior. This performance gain suggests that Greedy NMS is detrimental to the discriminative power of R-FCN. Because it discards a wide spectrum of information that is helpful on distinguishing positive/negative bounding box and uncertainties of object’s bounding box and categorical states. Greedy D $\&$ T has nearly the same and even a worse of $1.6$ performance in mAP by comparing with Single R-FCN methods. This a little worsened performance may due to the fact of incorrect fetching between convolution feature maps and correlation kernel. Kalman-Link method reaches the performance only second to our proposed methods for its ability to infer on the uncertainties of locations from the states of its previous trajectories. However its previous trajectories is established by greedy forward matching of object cross frames in a deterministic way. By considering the object matching uncertainties under a uniform prior assumptions on object appearance and associations, our performance could achieve a $1.6$ gain in PDQ from Kalman-Link. Actually, as we consider the uncertainties in object linking and its states jointly, our algorithm allows for inferring object linking uncertainties reversely by particle reweighting.

Table I: Comparison of our methods with 5 different kinds of baseline. Each baseline removes one or some of modules in our Bayesian framework and replaces with a naive non-Bayesian one. Our Bayesian one outperforms the baseline methods in all categories by introducing uncertainty treatment.

		Frame		Video
	PDQ	mAP	PDQ	mAP
Our Methods	39.4	72.1	40.2	75.3
Single R-FCN	31.7	70.3	32.1	70.9
Frame Bayesian	35.4	71.2	35.9	72.1
Greedy R-FCN	31.7	70.3	32.4	72.3
Greedy D $\&$ T	31.8	68.7	32.4	72.7
Kalman-Link	37.8	71.9	39.4	74.9

C Semi-Supervised Detection Result

We apply our methods to semi-supervised learning in M2Cai16-Tool-Location Dataset. In our implementation, we take another 3 consecutive frames from the labeling frames in a random order (forward and backward). We train on supervised learning loss in the first stage of our training epochs. And we add our semi-supervised loss term after our supervised training loss converges. Table II. shows our semi-supervised detection result in comparison with the one including supervised term only. Our semi-supervised learning algorithm achieves minor improvements over supervised learning from labeled frames only. Our method observes more obvious improvements on objects with low mAP.

Table II: Comparison of our semi-supervised learning algorithm with learning on supervised frames only.

		Supervised		Semi-Supervised
	PDQ	mAP	PDQ	mAP
Grasper	35.4	46.2	37.4	52.3
Bipolar	51.3	65.9	54.2	67.1
Hook	63.9	78.4	64.1	78.6
Scissors	50.2	66.8	54.3	69.1
Clipper	69.8	85.4	70.2	85.5
Irrigator	11.3	16.2	14.9	23.5
Specimen Bag	60.7	75.8	63.1	76.2

7 Conclusion

In this paper, we present our Bayesian model for multi-object detection and tracking in videos. Our method has shown the potential of formulating neural network model in a probabilistic way, especially for tasks that need to infer under uncertainties.

References

[1] David S Bolme, J Ross Beveridge, Bruce A Draper, and Yui Man Lui. Visual object tracking using adaptive correlation filters. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 2544–2550. IEEE, 2010.
[2] Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Detect to track and track to detect. In Proceedings of the IEEE International Conference on Computer Vision, pages 3038–3046, 2017.
[3] G David Forney. The viterbi algorithm. Proceedings of the IEEE, 61(3):268–278, 1973.
[4] András Frank. On kuhn’s hungarian method—a tribute from hungary. Naval Research Logistics (NRL), 52(1):2–5, 2005.
[5] David Hall, Feras Dayoub, John Skinner, Peter Corke, Gustavo Carneiro, and Niko Sünderhauf. Probability-based detection quality (pdq): A probabilistic approach to detection evaluation. arXiv preprint arXiv:1811.10800, 2018.
[6] Wei Han, Pooya Khorrami, Tom Le Paine, Prajit Ramachandran, Mohammad Babaeizadeh, Honghui Shi, Jianan Li, Shuicheng Yan, and Thomas S Huang. Seq-nms for video object detection. arXiv preprint arXiv:1602.08465, 2016.
[7] Ali Harakeh, Michael Smart, and Steven L Waslander. Bayesod: A bayesian approach for uncertainty estimation in deep object detectors. arXiv preprint arXiv:1903.03838, 2019.
[8] Zhanghexuan Ji, Mohammad Abuzar Shaikh, Dana Moukheiber, Sargur N Srihari, Yifan Peng, and Mingchen Gao. Improving joint learning of chest x-ray and radiology report by word region alignment. In International Workshop on Machine Learning in Medical Imaging, pages 110–119. Springer, 2021.
[9] Amy Jin, Serena Yeung, Jeffrey Jopling, Jonathan Krause, Dan Azagury, Arnold Milstein, and Li Fei-Fei. Tool detection and operative skill assessment in surgical videos using region-based convolutional neural networks. IEEE Winter Conference on Applications of Computer Vision, 2018.
[10] Kai Kang, Hongsheng Li, Junjie Yan, Xingyu Zeng, Bin Yang, Tong Xiao, Cong Zhang, Zhe Wang, Ruohui Wang, Xiaogang Wang, et al. T-cnn: Tubelets with convolutional neural networks for object detection from videos. IEEE Transactions on Circuits and Systems for Video Technology, 28(10):2896–2907, 2017.
[11] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
[12] Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, and Junjie Yan. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4282–4291, 2019.
[13] Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, and Xiaolin Hu. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8971–8980, 2018.
[14] Chris J Maddison, John Lawson, George Tucker, Nicolas Heess, Mohammad Norouzi, Andriy Mnih, Arnaud Doucet, and Yee Teh. Filtering variational objectives. In Advances in Neural Information Processing Systems, pages 6573–6583, 2017.
[15] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
[16] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
[17] Jack Valmadre, Luca Bertinetto, João Henriques, Andrea Vedaldi, and Philip HS Torr. End-to-end representation learning for correlation filter based tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2805–2813, 2017.
[18] Xinshuo Weng and Kris Kitani. A Baseline for 3D Multi-Object Tracking. arXiv:1907.03961, 2019.
[19] Tianzhu Zhang, Changsheng Xu, and Ming-Hsuan Yang. Multi-task correlation particle filter for robust object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4335–4343, 2017.
[20] Zheng Zhang, Dazhi Cheng, Xizhou Zhu, Stephen Lin, and Jifeng Dai. Integrated object detection and tracking with tracklet-conditioned detection. arXiv preprint arXiv:1811.11167, 2018.
[21] Zhipeng Zhang and Houwen Peng. Deeper and wider siamese networks for real-time visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4591–4600, 2019.
[22] Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen Wei. Flow-guided feature aggregation for video object detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 408–417, 2017.