This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Bayesian Detect to Track System for Robust Visual Object Tracking and Semi-Supervised Model Learning

Yan Shen Department of Computer Science and Engineering, University at Buffalo,
The State University of New York, Buffalo, NY, USA
{yshen22,zhanghex,chunweim,mgao8}@buffalo.edu
Zhanghexuan Ji Department of Computer Science and Engineering, University at Buffalo,
The State University of New York, Buffalo, NY, USA
{yshen22,zhanghex,chunweim,mgao8}@buffalo.edu
Chunwei Ma Department of Computer Science and Engineering, University at Buffalo,
The State University of New York, Buffalo, NY, USA
{yshen22,zhanghex,chunweim,mgao8}@buffalo.edu
Mingchen Gao Department of Computer Science and Engineering, University at Buffalo,
The State University of New York, Buffalo, NY, USA
{yshen22,zhanghex,chunweim,mgao8}@buffalo.edu
Abstract

Object tracking is one of the fundamental problems in visual recognition that achieves significant improvements in recent years. The achievements often come with the price of enormous hardware consumption and extensive labeling effort. One missing ingredient for robust tracking is gaining performance with minimum modification on network structure and model learning from intermittent labeled frames. In our work, we address these problems by modeling tracking and detection process in a probabilistic way as multi-object dynamics and frame detection uncertainties. Our stochastic model is formulated as a neural network parameterized distributions. With our formulation, we propose a particle filter-based tracking algorithm for object state estimation. We also present a semi-supervised learning algorithm from intermittent labeled frames by Variation Sequential Monte Carlo. We use our generated particles for estimating a variational bound as our learning objectives. In our experiments, we provide both mAP and probability-based detection measurements for comparison between our algorithm with non-Bayesian baselines. Our model outperforms non-Bayesian baselines from both measurements. We also apply our semi-supervised learning algorithm on M2Cai16-Tool-Locations dataset and outperforms the baseline methods of learning on labeled frames only.

1 Introduction

Visual object detection and tracking covers a large spectrum of computer vision applications such as video surveillance, motion analysis, action recognition, autonomous driving and medical operation studies. The emergence of deep Convolutional Neural Networks (CNN) [11] makes a tremendous progress on the visual object detection and tracking performances. CNN is widely utilized for theses tasks for two reasons. Firstly, CNN learns robust object features cross total variations among the whole training dataset.

Refer to caption
Figure 1: Several challenges exist for a robust tracking and detection system. (a) occlusions, (b) clutter detection, (c) motion blur.

Secondly, CNN’s shift covariance properties allow generating regional proposals from the area of maximum responses on object specific detection filters.

A common deep CNN-based video object tracking system consists of two parts of object detection and object displacement networks. R-CNN [15] network is the most commonly used backbone structures for object detection. Generally, a R-CNN network consists of two stages. In the first stage, a RPN network generates object likeness score and bounding box offset coordinates from a fixed number of predefined anchors on the output feature maps. In the second stage, plausible candidate regions of high object likeness score are pooled on another feature maps for refined coordinates offsets and object class score. For object displacement predictions, a correlation layer takes the features from Siamese network on reference frames and prediction frames for object displacement predictions. Several works [10, 6, 22] refine the results by proposing multi-stages of regional proposal detection, feature propagation, objects tube linking and post-processing.

The main challenge for robust tracking includes motion blur, partial occlusions and background clutters. This poses challenges for developing robust tracking algorithms by adding uncertainties for object state estimations. Here we show some typical challenges for robust detection and tracking in Fig. 1 In response to the increasing concerns about model robustness and generalizations without introducing extra cost, we see a resurgence of Bayesian models in recent years. Bayesian approaches use a probabilistic treatment of object appearances and states to deal with the uncertainties in prediction from models. However there are several challenges for taking existing works off-the-shelf for learning and making inference on network model outputs. Firstly, there are multiple objects appear and disappear in consecutive frames. Secondly, different objects linking possibilities exists for tracking objects across different frames. Finally, modeling the uncertainties of dynamic number of appeared object’s states from a fixed R-CNN structure’s outputs is still not well solved. In our work, we propose to address the first and the second problem by formulating a joint object dynamic over the distributions of a cascaded event of object appearing/disappearing, new object arriving and object associations. And we address the third problem by considering the R-CNN outputs as a clustered emission distributions from object’s ground states over appearance scores, classification scores and location coordinates.

In this paper, we formulate the problem of multi-object detection and tracking in a fully probabilistic way. Our model is parameterized by tracking and detection neural networks. We take the original network structure from the original detect to track paper [2] with minimum modification in notations and definitions. Our formulation consists of a transition models for objects dynamics and emission model for object detection. We take neural network’s outputs as our transition and emission distribution’s parameters. Our probabilistic model is capable of handling the multi-object appearing and disappearing problem by incorporating an object appearance and association prior. In our model, we see tracking and detection as inferring objects states posteriors from network outputs. And the posterior inference is proposed by our particle filter based sampling algorithm. We use our particle filter algorithm to generate samples from an approximated family of distributions and later re-weigh and re-sample in accordance with our model formulations. Our sampled trajectories is capable of taking the relations cross all visible frames into account. Finally, we present a Variation Sequential Monte Carlo (VSMC) [14] method to learn our model from intermittent labeled frames. Our VSMC method trains on unlabeled consecutive frames by optimizing a tractable evidence lower bound (ELBO). The ELBO is approximated from sampled labels generated by our particle filter algorithm. Our contribution in this paper are three folds.

  • We give a deep neural network parameterized Bayesian formulation of a detection and tracking system.

  • We propose a particle filter sampling algorithm to make robust estimation on tracking object states from neural network outputs.

  • We present a weakly-supervised training algorithm for our model by VSMC methods. Our VSMC algorithm trains on both labeled frames and unlabeled consecutive frames with the pseudo-label generated by our sampling algorithm.

2 Related Work

A comprehensive review of tracking and detection algorithm is beyond the scope of this paper. We review the works that are most related to our study.

Object Detection and Tracking Currently, state-of-art object detection and tracking systems consist of multiple stages of regional proposal detection, feature propagation, object tube linking and post-processing. In the first stage is to extract regional proposal candidates. Representing works include 2D R-CNN network in frame-level box proposal [6], 3D R-CNN network in video-level tube proposal [10] and feature propagation [22]. In the second stage, detected objects are linked together to make a tracking prediction. Either direct detection box [20] or tracking displacement [2] offset box are linked by bipartite graph algorithm [4] or Viterbi algorithm [3]. In the final stage, suppression methods are used for removing duplications and false positives.

Correlation Filter and Siamese Network Correlation filters have recently been introduced into visual tracking and shown to achieve high speed as well as robust performance [1, 17]. Siamese network with triplet loss has also shown advantages in clustering similar object samples, which has been commonly used in representation learning[8]. Bo [13] proposes a Siamese-RPN network that are trained end-to-end to regress tracking template locations. Bo [12] improves the performance by using a ResNet-50 backboned CNN and taking RPN predictions by aggregating multi-layer outputs. Zhipeng [21] proposes a deeper and wider Siamese network by adding multiple crops to remove padding effects. Instead of cropping the objects on template images, Feichtenhofer [2] uses ROI poolings on correlation feature maps for object displacement prediction from template to target.

Bayesian Model for Tracking and Detection Bayesian methods have been applied to object detection and tracking for its robust performance to deal with object state uncertainties. Tianzhu [19] combines the power of particle filter and correlation filter for robust tracking objects on image and video frames. They solve a correlation filter in dual space by accelerated proximal gradient method. And then they propose a particle filter tracker to generate particles using transition model, apply proposed correlation filter to shift it to a stable location and reweights the sample using the filtered responses. However their correlation filter is based on the shallow features in Fourier domains. Ali [7] combines the power of Bayesian inference and CNN for dealing with uncertainties of deep neural network’s observations. They assume a Gaussian prior on object location and Dirichlet prior on object categories. Xinshuo [18] proposes 3D object detection baseline. They use Gaussian distribution to formulate the object location uncertainties and use Kalman Filter to infer the trajectories of each single objects.

3 A Bayesian Formulation of Object Detection and Tracking System

In this section, we give a bayesian formulation of object detection and tracking, We consider the joint process of object detection and tracking as a HMM, where the distribution of object states cross frames are modeled as hidden states. We view tracking as the transition between neighboring hidden states and detection as emission from hidden states to a noisy visible states observed by R-FCN network outputs. We would show that in the case of supervised learning, the traditional tracking and detection loss takes a similar form of the maximum likelihood function in our formulated model.

A Definition and Notations

We denote the state’s of a object indexed at ii at frame tt as i,t\mathcal{B}_{i,t}. And i,t\mathcal{B}_{i,t} is defined as as a tuple of three variables.

i,t{Li,t,𝒞i}\mathcal{B}_{i,t}\triangleq\{L_{i,t},\mathcal{C}_{i}\} (1)

where Li,tL_{i,t} denotes the vector of bounding box’s locations, 𝒞i\mathcal{C}_{i} denotes the category of the object.

The category of the object i,t\mathcal{B}_{i,t} keeps unchanged cross all the frames i,t\mathcal{B}_{i,t} appears. And the Li,tL_{i,t} location shifts between frames. We model the transitional distributions of object i between frame tt and t+1t+1 as a neural network parameterized Gaussian distributions.

pg(Li,t+1|Li,t)𝒩(gμϕ(i,t),gσϕ(i,t))p_{g}(L_{i,t+1}|L_{i,t})\sim\mathcal{N}(g^{\phi}_{\mu}(\mathcal{B}_{i,t}),g^{\phi}_{\sigma}(\mathcal{B}_{i,t})) (2)

where gμϕ(i,t),gσϕ(i,t)g^{\phi}_{\mu}(\mathcal{B}_{i,t}),g^{\phi}_{\sigma}(\mathcal{B}_{i,t}) is given by the RoI output’s from the deep correlational kernel network in Feichtenhofer[2].

And at each frame tt, the R-FCN gives M anchored observations as [^1,t,^M,t][\hat{\mathcal{B}}_{1,t},...\hat{\mathcal{B}}_{M,t}]. Each anchored observations is either anchored with a true object i,t\mathcal{B}_{i,t} with above definition or anchored with a clutter observations (false positive). We make a simplified assumption that each box output corresponds to a unique objects in the ground truth state. Actually, this is the most widely adopt assumptions in traditional R-FCN training and inference. In our simplified assumptions, we introduce an anchor to objects variable 𝐮\mathbf{u} which is given by

ui=k{1,2Kt}u_{i}=k\in\{1,2...K_{t}\} (3)

where the R-FCN’s anchored observation ^i,t\hat{\mathcal{B}}_{i,t} observation is clustered around objects k,t\mathcal{B}_{k,t}. In our model, the emission distributions from clusters k,t\mathcal{B}_{k,t} to anchored observations also follows a network parameterized Gaussian distributions

pf(L^i,t|Lk,t)𝒩(Lk,t,fθ(k,t))p_{f}(\hat{L}_{i,t}|L_{k,t})\sim\mathcal{N}(L_{k,t},f_{\theta}(\mathcal{B}_{k,t})) (4)

To distinguish between ground truth observation and clutter observation, we use another binary variable i,t\mathcal{E}_{i,t} which is given by

i,t={0k,tis associated with one real object1k,tis a clutter observations\mathcal{E}_{i,t}=\begin{cases}0&\quad\mathcal{B}_{k,t}\text{is associated with one real object}\\ 1&\quad\mathcal{B}_{k,t}\text{is a clutter observations}\end{cases} (5)

B Tracking as Object Dynamics and Association

We assume that there are KtK_{t} objects appears at frame tt. At next frame, each object remain/disappears independently at death probability λD\lambda_{D} with K^t\hat{K}_{t} remaining objects and ΔKt\Delta K_{t} new coming target arrive at the rate of λL\lambda_{L} following Poisson distributions. The whole object appearance dynamics could be written as

p(K^t,ΔKt)=λDKtK^t(1λD)K^tλLΔKteλLΔKt!p(\hat{K}_{t},\Delta K_{t})={\lambda_{D}}^{K_{t}-\hat{K}_{t}}{(1-\lambda_{D})}^{\hat{K}_{t}}\frac{\lambda_{L}^{\Delta K_{t}}e^{-\lambda_{L}}}{\Delta K_{t}!} (6)

Considering all of the objects are observed in an unknown order, we formulate the distribution of a reordering of K^t+1+ΔKt\hat{K}_{t+1}+\Delta K_{t} objects by introducing a measurement to target association(M\toT) hypothesis as λ=(𝐫,K^t+1,K^t+1+ΔKt)\lambda=(\mathbf{r},\hat{K}_{t+1},\hat{K}_{t+1}+\Delta K_{t}). And the element of association vector is defined as 𝐫=(r1,r2,r3,rK^t+1)\mathbf{r}=(r_{1},r_{2},r_{3},...r_{\hat{K}_{t+1}}) which is given by

rj=k{1,2,3K^t+1+ΔKt}r_{j}=k\in\{1,2,3...\hat{K}_{t+1}+\Delta K_{t}\} (7)

where the j’s object at the frame tt is associated with k’s object at frame t+1t+1. And we also assign a uniform prior for the association vector 𝐫\mathbf{r}

p(𝐫)=(K^t+ΔKtK^t)1p(\mathbf{r})=\binom{\hat{K}_{t}+\Delta K_{t}}{\hat{K}_{t}}^{-1} (8)

The newly appeared object j,t+1\mathcal{B}_{j,t+1} follows a Gaussian prior on object locations Lj,t+1L_{j,t+1} and a uniform prior on object categories 𝒞j\mathcal{C}_{j}

p0(Lj,t+1)\displaystyle p_{0}(L_{j,t+1}) 𝒩(μ0,Σ0)\displaystyle\sim\mathcal{N}(\mathbf{\mu}_{0},\Sigma_{0})
p0(𝒞j)\displaystyle p_{0}(\mathcal{C}_{j}) F(k,0,K1)\displaystyle\sim F(k,0,K-1) (9)

For existing object j,t+1\mathcal{B}_{j,t+1}, its location Lj,t+1L_{j,t+1} is updated by following transitional distribution defined in Equation.2 which is parameterized by tracking network. And the categories 𝒞j\mathcal{C}_{j} keeps the same with its previous state.

With the above definition, the tracking process is modeled as the joint transition probability of object dynamics and associations

p({1,t+1K^t+1+ΔKt,t+1},𝐫|{1,tKt,t})\displaystyle p(\{\mathcal{B}_{1,t+1}...\mathcal{B}_{\hat{K}_{t+1}+\Delta K_{t},t+1}\},\mathbf{r}|\{\mathcal{B}_{1,t}...\mathcal{B}_{K_{t},t}\})
=\displaystyle= p(𝐫)p(K^t,ΔKt)\displaystyle p(\mathbf{r})p(\hat{K}_{t},\Delta K_{t})
i=1K^tpg(Lri,t+1|Li,t))I(𝒞ri,t+1|𝒞i,t)i𝐫p0(i)p0(𝒞i)\displaystyle\prod_{i=1}^{\hat{K}_{t}}p_{g}(L_{r_{i},t+1}|L_{i,t}))I(\mathcal{C}_{r_{i},t+1}|\mathcal{C}_{i,t})\prod_{i\notin\mathbf{r}}p_{0}(\mathcal{B}_{i})p_{0}(\mathcal{C}_{i})

where I()I(\cdot) is an indicator function where I(𝒞ri,t+1|𝒞i,t)=1I(\mathcal{C}_{r_{i},t+1}|\mathcal{C}_{i,t})=1 for 𝒞ri,t+1=𝒞i,t\mathcal{C}_{r_{i},t+1}=\mathcal{C}_{i,t} and I(𝒞ri,t+1|𝒞i,t)=0I(\mathcal{C}_{r_{i},t+1}|\mathcal{C}_{i,t})=0 for 𝒞ri,t+1𝒞i,t\mathcal{C}_{r_{i},t+1}\neq\mathcal{C}_{i,t} The whole object dynamics is shown in Fig. 2

Refer to caption
Figure 2: The whole object dynamics in our model. Multiple objects appear/disappear and observed in unknown sequence.

C Detection as Object Emission

We consider the emission distribution from a real object i,t\mathcal{B}_{i,t} to its anchored observations ^k,t\hat{\mathcal{B}}_{k,t} from R-FCN outputs. Recall the properties of a R-FCN, its emissions from the final layer of feature maps consists of three parts as object appearance score ek,te_{k,t}, object classification score 𝒦kt\mathcal{K}_{k_{t}} and object location coordinates Lk,tL_{k,t}.
Instead of viewing the probability output as a distribution over object’s existence and categories directly, we see the R-FCN’s’s a categorial digits outputs as a categorical distribution conditioned on the real category 𝒞i\mathcal{C}_{i} of its belonging object. And in our formulation 𝒦kt\mathcal{K}_{k_{t}} follows a Dirichlet distribution with α\alpha conditioned on 𝒞i\mathcal{C}_{i}

p(𝒦^i,t|𝒞i)Dir(α1,αK)p(\mathcal{\hat{K}}_{i,t}|\mathcal{C}_{i})\sim\textit{Dir}(\alpha_{1},......\alpha_{K}) (10)

in our formulation α1,αK\alpha_{1},......\alpha_{K} is set as

αk={α+1if k=𝒞i1if k𝒞i\alpha_{k}=\begin{cases}\alpha+1&\quad\text{if }k=\mathcal{C}_{i}\\ 1&\quad\text{if }k\neq\mathcal{C}_{i}\end{cases} (11)

Similarly, we treat the object appearance score ek,te_{k,t} as a categorical distribution in a beta distributions conditioned on i,j\mathcal{E}_{i,j} indicating whether the ^k,t\hat{\mathcal{B}}_{k,t} is anchored around a real object or a clutter observations.

p(e^j,tM|i,t)Beta(α0,α1)p(\hat{e}^{M}_{j,t}|\mathcal{E}_{i,t})\sim\textit{Beta}(\alpha_{0},\alpha_{1}) (12)

α0,α1\alpha_{0},......\alpha_{1} is set as

α0,α1={α+1,1if i,t=01,α+1if i,t=1\alpha_{0},\alpha_{1}=\begin{cases}\alpha+1,1&\quad\text{if }\mathcal{E}_{i,t}=0\\ 1,\alpha+1&\quad\text{if }\mathcal{E}_{i,t}=1\end{cases} (13)

As for ^k,t\hat{\mathcal{B}}_{k,t} is anchored around real object i,t\mathcal{B}_{i,t},

p(e^j,t|i,t)Beta(1,α+1)p(\hat{e}_{j,t}|\mathcal{E}_{i,t})\sim\textit{Beta}(1,\alpha+1) (14)

For object location coordinates outputs L^k,t\hat{L}_{k,t}, we consider L^k,t\hat{L}_{k,t} as a noisy observation of object’s real locations Li,tL_{i,t}. L^k,t\hat{L}_{k,t} follows a Gaussian distribution given by Eq. (4) with its mean as Li,tL_{i,t} and variance as fθ(k,tf_{\theta}(\mathcal{B}_{k,t}. In our work, we do not treat fθ(k,t)f_{\theta}(\mathcal{B}_{k,t}) as a direct output by R-FCN network. Instead, We follow the line with [7] by viewing covariance as combined contributions of model uncertainty and prediction uncertainty. Similar with their work, we omit the model uncertainty and use prediction uncertainty Σ(i,t)\Sigma(\mathcal{B}_{i,t}) as an approximation for fθ(k,t)f_{\theta}(\mathcal{B}_{k,t}). The prediction uncertainty Σ(i,t)\Sigma(\mathcal{B}_{i,t}) is taking as the covariance of all output coordinate predictions L^k,t\hat{L}_{k,t} for all anchors around k,t\mathcal{B}_{k,t}.

μ(i,t)\displaystyle\mu(\mathcal{B}_{i,t}) =1Miuj=ifθ(^j,t)\displaystyle=\frac{1}{M_{i}}\sum_{u_{j}=i}f^{\theta}(\mathcal{\hat{B}}_{j,t}) (15)
Σ(i,t)\displaystyle\Sigma(\mathcal{B}_{i,t}) αMi(uj=ifθ(^j,t)fθ(^j,t)T)αμ(i,t)μ(i,t)T\displaystyle\approx\frac{\alpha}{M_{i}}(\sum_{u_{j}=i}f^{\theta}(\mathcal{\hat{B}}_{j,t})f^{\theta}(\mathcal{\hat{B}}_{j,t})^{T})-\alpha\mu(\mathcal{B}_{i,t})\mu(\mathcal{B}_{i,t})^{T} (16)

For anchored observation ^k,t\hat{\mathcal{B}}_{k,t} that associates with a clutter ^i,t0\hat{\mathcal{B}}^{0}_{i,t}, we assume a Gaussian prior for clutter location and uniform prior on clutter categories in the same form as Eq. (B).

With the above definition, the joint emission probability is given by

p({^1,t^M,t}|{1,tKt,t})\displaystyle p(\{\mathcal{\hat{B}}_{1,t}...\mathcal{\hat{B}}_{M,t}\}|\{\mathcal{B}_{1,t}...\mathcal{B}_{K_{t},t}\})
=Γi(E^i,t=1)e^j,tα𝒩(fθ(L^i,t);Lui,t,fθ(k,t))ciα\displaystyle=\Gamma\prod_{i\in(\hat{E}_{i,t}=1)}\hat{e}_{j,t}^{\alpha}\mathcal{N}(f^{\theta}(\hat{L}_{i,t});L_{u_{i},t},f_{\theta}(\mathcal{B}_{k,t}))c_{i}^{\alpha}
Lui,tCuii(E^i,t=0)(1e^j,t)α𝒩(fθ(L^i,t);Lui,t,fθ(k,t))ciα\displaystyle\sum_{L_{u_{i},t}}\sum_{C_{u_{i}}}\prod_{i\in(\hat{E}_{i,t}=0)}(1-\hat{e}_{j,t})^{\alpha}\mathcal{N}(f^{\theta}(\hat{L}_{i,t});L_{u_{i},t},f_{\theta}(\mathcal{B}_{k,t}))c_{i}^{\alpha}
p0(Lj,t)p0(Cui)\displaystyle\hskip 6.00006ptp_{0}(L_{j,t})p_{0}(C_{u_{i}})

where ci=K^i,t(𝒞ui)c_{i}=\hat{K}_{i,t}(\mathcal{C}_{u_{i}}) is the categorical score for its associated object’s class. Because we do not infer on its location and categories for negative anchors, we take the marginalized distribution for categorical and location output score for negative anchors. The whole R-FCN detection process is shown in Fig. 3

Refer to caption
Figure 3: Out Bayesian view of traditional tracking and detection parameterized Output

D Traditional D&T Loss as Model Likelihood

In the case of supervised learning, we have annotations of all hidden states. And all the visual states and distribution are parameterized by the network outputs. All the association variable is also annotated for tracking. We train the network by taking the log likelihood of joint probability as

\displaystyle\mathcal{L} =t=1Tlogp({^1,t^M,t}|{1,tKt,t})\displaystyle=\sum_{t=1}^{T}\log p(\{\mathcal{\hat{B}}_{1,t}...\mathcal{\hat{B}}_{M,t}\}|\{\mathcal{B}_{1,t}...\mathcal{B}_{K_{t},t}\})
+t=1T1logp({1,t+1K^t+1+ΔKt,t+1},𝐫t|{1,tKt,t})\displaystyle+\sum_{t=1}^{T-1}\log p(\{\mathcal{B}_{1,t+1}...\mathcal{B}_{\hat{K}_{t+1}+\Delta K_{t},t+1}\},\mathbf{r}_{t}|\{\mathcal{B}_{1,t}...\mathcal{B}_{K_{t},t}\})
=i{ri{1:Kt+1}},t12Lri,t+1gμϕ(i,t)gσϕ(i,t)1+12log(gσϕ(i,t))\displaystyle=\sum_{i\in\{r_{i}\in\{1:K_{t+1}\}\},t}\frac{1}{2}||L_{r_{i},t+1}-g^{\phi}_{\mu}(\mathcal{B}_{i,t})||_{g^{\phi}_{\sigma}(\mathcal{B}_{i,t})^{-1}}+\frac{1}{2}\log(g^{\phi}_{\sigma}(\mathcal{B}_{i,t}))
+i=1,tiM12|||L^i,tLui,t||Σ1+12logdet(Σ1(i,t))\displaystyle+\sum_{i=1,t}^{i\leq M}\frac{1}{2}|||\hat{L}_{i,t}-L_{u_{i},t}||_{\Sigma^{-1}}+\frac{1}{2}\log\det({\Sigma^{-1}(\mathcal{B}_{i,t})})
+H(eui,ui)+H(𝒦i,t,𝒞ui)+const\displaystyle+H(e_{u_{i}},\mathcal{E}_{u_{i}})+H(\mathcal{K}_{i,t},\mathcal{C}_{u_{i}})+const
=track+detect+const\displaystyle=\mathcal{L}_{track}+\mathcal{L}_{detect}+const (17)

where track\mathcal{L}_{track} and detect\mathcal{L}_{detect} is our commonly used training loss for deep correlational kernel tracking network(or Siamese Network) and R-FCN detection network. The only difference is that we introduce a gσϕ(i,t)g^{\phi}_{\sigma}(\mathcal{B}_{i,t}) as a parameterized network outputs indicating the independent variance for tracking coordinate prediction. And we put the term 12Lri,t+1gμϕ(i,t)gσϕ(i,t)1+12log(gσϕ(i,t))\frac{1}{2}||L_{r_{i},t+1}-g^{\phi}_{\mu}(\mathcal{B}_{i,t})||_{g^{\phi}_{\sigma}(\mathcal{B}_{i,t})^{-1}}+\frac{1}{2}\log(g^{\phi}_{\sigma}(\mathcal{B}_{i,t})) for joint training of tracking mean and variance. We use traditional detection loss for training detection network and omits the term 12logdet(Σ1(i,t))\frac{1}{2}\log\det({\Sigma^{-1}(\mathcal{B}_{i,t})}) in our training. Here we also omit the term of marginal distribution of location and categories for clutter anchors.

4 A Particle Filter Object State Estimation Algorithm For Robust Tracking

With our formulation in previous section, our robust tracking is by inferring the posterior of all objects states and their associations cross frames from the observations on R-FCN and tracking dynamics on tracking network. As our probabilistic formulation is straightforward for joint probability verification but less straightforward for sampling posterior, directly inferring the joint states of objects and their associations would be intractable. An analytical solution would require traversing all the possible object appearing and association states which is impractical for implementation. Here we give a particle filter based sampling solution by sampling from an approximated family of straightforward sampling distributions.

Initial Sample on R-FCN Detectors    Assume that R-FCN outputs M anchored prediction of bounding box state {^1,t^M,t}\{\mathcal{\hat{B}}_{1,t}...\mathcal{\hat{B}}_{M,t}\} at frame tt. And we sample {1,tKt,t}\{\mathcal{B}_{1,t}...\mathcal{B}_{K_{t},t}\} according the the posterior

r1({1,tKt,t})p({^1,t^M,t}|{1,tKt,t})r_{1}(\{\mathcal{B}_{1,t}...\mathcal{B}_{K_{t},t}\})\propto p(\{\mathcal{\hat{B}}_{1,t}...\mathcal{\hat{B}}_{M,t}\}|\{\mathcal{B}_{1,t}...\mathcal{B}_{K_{t},t}\})

And we sample r1({1,tKt,t})r_{1}(\{\mathcal{B}_{1,t}...\mathcal{B}_{K_{t},t}\})in the following way(the proof would be given in the later part of this section). First, per-anchor outputs from the neural network are clustered in a similar way with NMS. Greedy clustering is performed using the output category score by choosing the anchor with the highest non-background score as the cluster center, adding any anchor with an intersection over union(IOU) greater than 0.5 to the cluster, and eliminating all members in the cluster from the original updated anchor set. This process is repeated until all the anchors are assigned to clusters. By seeing each cluster as a object candidate, we have candidates objects of {~1,t~Kt,t}\{\tilde{\mathcal{B}}_{1,t}...\tilde{\mathcal{B}}_{K_{t},t}\}. Then we sample the objects are their states in the following way.

  1. Step 1:

    Include object ~i,t\tilde{\mathcal{B}}_{i,t} with probability pip_{i}

    pi=j(uj=i)(e^j)α/(j(uj=i)(e^j)α+juj=i(1e^j)α)p_{i}=\prod_{j\in(u_{j}=i)}(\hat{e}_{j})^{\alpha}/(\prod_{j\in(u_{j}=i)}(\hat{e}_{j})^{\alpha}+\prod_{j\in u_{j}=i}(1-\hat{e}_{j})^{\alpha})
  2. Step 2:

    Assign initial box state μ(i,t)\mu(\mathcal{B}_{i,t}) by Eq. (15)

  3. Step 3:

    Assign a new tracking id for each i,t\mathcal{B}_{i,t}.

  4. Step 4:

    Sample object class 𝒞iCategorical(ci)\mathcal{C}_{i}\sim\text{Categorical}(c_{i})

    ci=j(uj=i)𝒦j,kα/kj(uj=i)𝒦j,kαc_{i}=\prod_{j\in(u_{j}=i)}\mathcal{K}_{j,k}^{\alpha}/\sum_{k^{\prime}}\prod_{j\in(u_{j}=i)}\mathcal{K}_{j,k^{\prime}}^{\alpha}

Data Association    After we include our sampled objects and their states, we associate our generated initial objects {1,tK~t,t}l\{\mathcal{B}_{1,t}...\mathcal{B}_{\tilde{K}_{t},t}\}^{l} to the objects in their ancestor particles {1,t1Kt1,t1}at1l\{\mathcal{B}_{1,t-1}...\mathcal{B}_{K_{t-1},t-1}\}^{a_{t-1}^{l}}. To match these two sets of objects, we first construct a affinity matrix J{J} with dimension of K~t×Kt1\tilde{K}_{t}\times K_{t-1}. We assign a score to the elements of JJ using the summation of IoU between objects and their class scores

Ji,j=logIoU(gϕμ(i,t1at1l),μ(j,t))+log𝒦j,𝒞iJ_{i,j}=\log IoU(g_{\phi}^{\mu}(\mathcal{B}_{i,t-1}^{a_{t-1}^{l}}),\mu(\mathcal{B}_{j,t}))+\log\mathcal{K}_{j,\mathcal{C}_{i}} (18)

Then we apply Hungarian algorithm to solve the bipartite graph matching problem in polynomial time. In addition, we reject matching objects i,t1\mathcal{B}_{i,t-1} with j,t1\mathcal{B}_{j,t-1} when j,t\mathcal{B}_{j,t} when their IoU is below then a given threshold of IoUminIoU_{min}. Without loss of generality, we assume to output a set of matched objects {^1,tK^t,t}l\{\hat{\mathcal{B}}_{1,t}...\mathcal{B}_{\hat{K}_{t},t}\}^{l} in frame tt with their ancestor {^r1,t1rK^t,t1}at1l\{\hat{\mathcal{B}}_{r_{1},t-1}...\mathcal{B}_{r_{\hat{K}_{t}},t-1}\}^{a_{t-1}^{l}} in frame t1t-1, along with the unmatched objects {K^t+1,tK~t,t}l\{\mathcal{B}_{\hat{K}_{t}+1,t}...\mathcal{B}_{\tilde{K}_{t},t}\}^{l}.

Independent Proposal Update for Tracking Object After we associate our sampled objects with their ancestors, we update the state of each sampled objects independently by sampling their posterior conditioned on their ancestors. Prior on the transition probability of matched objects, we sample each object’s bounding box state from their corresponding posterior. We use this proposal distribution to minimise the variance of the importance weights

r2(i,t)\displaystyle r_{2}(\mathcal{B}_{i,t}) p(i,t|ri1,t1at1l)juj=ip(^j,t|i,t)\displaystyle\propto p(\mathcal{B}_{i,t}|\mathcal{B}_{r_{i}^{-1},t-1}^{a_{t-1}^{l}})\prod_{j\in u_{j}=i}p(\hat{\mathcal{B}}_{j,t}|\mathcal{B}_{i,t})
=𝒩(μ(i,t),Σ(i,t))\displaystyle=\mathcal{N}(\mu^{\prime}(\mathcal{B}_{i,t}),\Sigma^{\prime}(\mathcal{B}_{i,t})) (19)

where the sufficient statistics of proposal in Eq. (19) can be estimated in closed form as:

Σ(i,t)\displaystyle\Sigma^{\prime}(\mathcal{B}_{i,t}) =(diag1(gϕσ(ri1,t1at1l))+MiΣ1(i,t))1\displaystyle=(\mathrm{diag}^{-1}(g_{\phi}^{\sigma}(\mathcal{B}^{a_{t-1}^{l}}_{r_{i}^{-1},t-1}))+M_{i}\Sigma^{-1}(\mathcal{B}_{i,t}))^{-1} (20)
μ(i,t)\displaystyle\mu^{\prime}(\mathcal{B}_{i,t}) =Σ((gϕσ(ri1,t1at1l))1gϕμ(ri1,t1at1l)\displaystyle=\Sigma^{\prime}((g_{\phi}^{\sigma}(\mathcal{B}^{a_{t-1}^{l}}_{r_{i}^{-1},t-1}))^{-1}\otimes g_{\phi}^{\mu}(\mathcal{B}^{a_{t-1}^{l}}_{r_{i}^{-1},t-1})
+MiΣ1(i,t)μ(i,t))\displaystyle+M_{i}\Sigma^{-1}(\mathcal{B}_{i,t})\mu(\mathcal{B}_{i,t})) (21)

where MiM_{i} is the number of anchored observations for object i,t\mathcal{B}_{i,t} in R-FCN.

Similarly, for unmatched objects ii^{\prime}, μ0(i,t),Σ0(i,t)\mu^{\prime}_{0}(\mathcal{B}_{i^{\prime},t}),\Sigma^{\prime}_{0}(\mathcal{B}_{i^{\prime},t}) could be given by

Σ0(i,t)\displaystyle\Sigma_{0}^{\prime}(\mathcal{B}_{i^{\prime},t}) =(Σ01+MiΣ1(i,t))1\displaystyle=(\Sigma^{-1}_{0}+M_{i^{\prime}}\Sigma^{-1}(\mathcal{B}_{i^{\prime},t}))^{-1} (22)
μ0(i,t)\displaystyle\mu^{\prime}_{0}(\mathcal{B}_{i^{\prime},t}) =Σ(Σ01μ0+MiΣ1(i,t)μ(i,t))\displaystyle=\Sigma^{\prime}(\Sigma^{-1}_{0}\mu_{0}+M_{i^{\prime}}\Sigma^{-1}(\mathcal{B}_{i^{\prime},t})\mu(\mathcal{B}_{i^{\prime},t})) (23)

Finally, we keep the sampled class 𝒞i\mathcal{C}_{i^{\prime}} for unmatched objects i{i^{\prime}}. And we discard the sampled class for matched object ii, and assign 𝒞i\mathcal{C}_{i} with the class of its matched ancestor objects 𝒞ri1at1l\mathcal{C}_{r^{-1}_{i}}^{a_{t-1}^{l}}. Similarly, we keep the tracking id for unmatched objects and assign matched objects i,t\mathcal{B}_{i,t} with an id of id(ri1,t1at1l)id(\mathcal{B}_{r_{i}^{-1},t-1}^{a_{t-1}^{l}})

Theorem 1

The importance weight wtlw_{t}^{l} of our sampled objects state {1,tK~t,t}l\{\mathcal{B}_{1,t}...\mathcal{B}_{\tilde{K}_{t},t}\}^{l} from its ancestor state {1,t1Kt1,t1}at1l\{\mathcal{B}_{1,t-1}...\mathcal{B}_{K_{t-1},t-1}\}^{a_{t-1}^{l}} is given by

wtl\displaystyle w_{t}^{l}\propto (1λD)Kt1K^tλDK^tObject Appearancep((K~tK^t);λL)New Object Arriving(K~tK^t)1Association Prior\displaystyle\underbrace{(1-\lambda_{D})^{K_{t-1}-\hat{K}_{t}}\lambda_{D}^{\hat{K}_{t}}}_{\text{Object Appearance}}\underbrace{p((\tilde{K}_{t}-\hat{K}_{t});\lambda_{L})}_{\text{New Object Arriving}}\underbrace{\binom{\tilde{K}_{t}}{\hat{K}_{t}}^{-1}}_{\text{Association Prior}}
e12RDτDi𝐫𝒦i,𝒞ri1at1lTracking Class Probability\displaystyle e^{-\frac{1}{2}R_{D}}\tau_{D}\underbrace{\prod_{i\in\mathbf{r}}\mathcal{K}_{i,\mathcal{C}_{r^{-1}_{i}}^{a_{t-1}^{l}}}}_{\text{Tracking Class Probability}} (24)

where is RDR_{D} is given by

RD\displaystyle R_{D} =i𝐫gϕμ(ri1,t1at1l)2((gϕσ(ri1,t1at1l)1)Tracking Transition Prior\displaystyle=\sum_{i\in\mathbf{r}}\underbrace{g_{\phi}^{\mu}(\mathcal{B}^{a_{t-1}^{l}}_{r_{i}^{-1},t-1})^{\circ 2}\odot((g_{\phi}^{\sigma}(\mathcal{B}^{a_{t-1}^{l}}_{r_{i}^{-1},t-1})^{-1})}_{\text{Tracking Transition Prior}}
μ0TΣ01μ0New Object Priorμ(i,t)TΣ(i,t)1μ(i,t)Matched Location Posterior\displaystyle-\underbrace{\mu_{0}^{T}\Sigma^{-1}_{0}\mu_{0}}_{\text{New Object Prior}}-\underbrace{\mu^{\prime}(\mathcal{B}_{i,t})^{T}\Sigma^{\prime}(\mathcal{B}_{i,t})^{-1}\mu^{\prime}(\mathcal{B}_{i,t})}_{\text{Matched Location Posterior}} (25)
+μ0(i,t)TΣ0(i,t)1μ0(i,t)Unmatched Location Posterior\displaystyle+\underbrace{\mu^{\prime}_{0}(\mathcal{B}_{i,t})^{T}\Sigma^{\prime}_{0}(\mathcal{B}_{i,t})^{-1}\mu^{\prime}_{0}(\mathcal{B}_{i,t})}_{\text{Unmatched Location Posterior}} (26)

τD\tau_{D} is given by

τD\displaystyle\tau_{D} =i𝐫|gϕσ(ri1,t1at1l)|1/2det(Σ0)1/2\displaystyle=\prod_{i\in\mathbf{r}}|g_{\phi}^{\sigma}(\mathcal{B}^{a_{t-1}^{l}}_{r_{i}^{-1},t-1})|^{-1/2}_{\otimes}\det(\Sigma_{0})^{1/2}
det(Σ(i,t)1)1/2det(Σ0(i,t)1)1/2\displaystyle\det(\Sigma^{\prime}(\mathcal{B}_{i,t})^{-1})^{1/2}\det(\Sigma^{\prime}_{0}(\mathcal{B}_{i,t})^{-1})^{-1/2} (27)

5 VSMC Algorithm For Semi-Supervised Model Learning

In this section, we describe our model training algorithm in the case of semi-supervised settings. In such settings, we only have the labeled bounding box on frame tt. In the neighboring frames {tT:t1}\{t-T:t-1\} and {t+1:t+T}\{t+1:t+T\}, only visual input is given without labeled bounding box. Different from our supervised learning algorithm in sectionD, training on the joint distribution for the unlabled frames is not feasible. Instead, our objective function is set as log likelihood of the marginal distributions where the unseen annotations are marginalized.

\displaystyle\mathcal{L} =logp({^1:M,tT},..{^1:M,t},{^1:M,t+T}Network Predictions from Data,{1:Kt,t}Labels for framet)\displaystyle=\log p(\{\underbrace{\mathcal{\hat{B}}_{1:M,t-T}\},..\{\mathcal{\hat{B}}_{1:M,t}\},\{\mathcal{\hat{B}}_{1:M,t+T}\}}_{\text{Network Predictions from Data}},\underbrace{\{\mathcal{B}_{1:K_{t},t}\}}_{\text{Labels for frame}t})
=logp({^1:M,t},{1:Kt,t})detectat framet\displaystyle=\underbrace{\log p(\{\mathcal{\hat{B}}_{1:M,t}\},\{\mathcal{B}_{1:K_{t},t}\})}_{\mathcal{L}_{detect}\text{at frame}t}
+logp({^1:M,t+1},{^1:M,t+T}|{1:Kt,t})Forward Marginal Likelihood\displaystyle+\underbrace{\log p(\{\mathcal{\hat{B}}_{1:M,t+1}\},...\{\mathcal{\hat{B}}_{1:M,t+T}\}|\{\mathcal{B}_{1:K_{t},t}\})}_{\text{Forward Marginal Likelihood}}
+logp({^1:M,tT},..{^1:M,t1}|{1:Kt,t})Backward Marginal Likelihood\displaystyle+\underbrace{\log p(\{\mathcal{\hat{B}}_{1:M,t-T}\},..\{\mathcal{\hat{B}}_{1:M,t-1}\}|\{\mathcal{B}_{1:K_{t},t}\})}_{\text{Backward Marginal Likelihood}}
=\displaystyle= t=T,t0Tlogp({^1:M,t+t}|{1:Kt+t,t+t})\displaystyle\sum_{t^{\prime}=-T,t^{\prime}\neq 0}^{T}\log p(\{\mathcal{\hat{B}}_{1:M,t+t^{\prime}}\}|\{\mathcal{B}_{1:K_{t+t^{\prime}},t+t^{\prime}}\})
+logp({1:Kt+t+1,t+t}|{1:Kt+t,t+t1})\displaystyle+\hskip 10.00002pt\log p(\{\mathcal{B}_{1:K_{t+t^{\prime}+1},t+t^{\prime}}\}|\{\mathcal{B}_{1:K_{t+t^{\prime}},t+t^{\prime}-1}\})
+logp({^1:M,t},{1:Kt,t})\displaystyle+\log p(\{\mathcal{\hat{B}}_{1:M,t}\},\{\mathcal{B}_{1:K_{t},t}\}) (28)

Eq. (5) is derived from the property of HMM. The total loss objective could be decomposed into three terms. The first term logp({^1:M,t},{1:Kt,t})\log p(\{\mathcal{\hat{B}}_{1:M,t}\},\{\mathcal{B}_{1:K_{t},t}\}) is the supervised detection loss on the same form as the traditional R-FCN detection loss. And the second term is the marginal likelihood of forward predictions on frame {t+1:t+T}\{t+1:t+T\}. The third term is the marginal likelihood of backward prediction on frame {tT:t1}\{t-T:t-1\}. Both marginal likelihood for forward and backward predictions are conditioned on frame tt’s annotation {1:Kt,t}\{\mathcal{B}_{1:K_{t},t}\} as a weakly supervised training signal for the unlabeled neighbor frames. As the backward likelihood could be factorized in the same structure of forward likelihood. We only derive the variational bound for forward likelihood in later part of the section. For the semi-supervised term, training directly is intractable, we use the following surrogate ELBO as a substitution

~=t=1T𝔼r(t1:tt1:N,a1:t11:N;λ)[log(1Nl=1Nwt+tl)]\vspace{-1em}\tilde{\mathcal{L}}=\sum_{t^{\prime}=1}^{T}\mathbb{E}_{r_{(}\mathcal{B}_{t-1:t-t^{\prime}}^{1:N},a_{1:t^{\prime}-1}^{1:N};\lambda)}[\log(\frac{1}{N}\sum_{l=1}^{N}w_{t+t^{\prime}}^{l})] (29)
Refer to caption
Figure 4: Our semi-supervised learning algorithm takes the sum of supervised loss on labeled frame and semi-supervised loss on consecutive frames. Our loss takes the weighted sum cross all sampled particles on their weights. The loss term for each sampled particle consists of detection loss on its own frames and tracking loss from the previous frame on its sampled trajectory.

For convenience of understanding the surrogate ELBO loss for semi-supervised learning of our tracking and detection model. We link the surrogate ELBO loss to our widely adopted tracking and detection loss. By taking the gradient, the surrogate loss could be rewritten as

~\displaystyle\nabla\tilde{\mathcal{L}} =t=1T𝔼r(t1:tt1:N,a1:t11:N;λ)log(1Nl=1Nwt+tl)\displaystyle=\sum_{t^{\prime}=1}^{T}\mathbb{E}_{r_{(}\mathcal{B}_{t-1:t-t^{\prime}}^{1:N},a_{1:t^{\prime}-1}^{1:N};\lambda)}\nabla\log(\frac{1}{N}\sum_{l=1}^{N}w_{t+t^{\prime}}^{l})
=t=1T𝔼r(t1:tt1:N,a1:t1l:N;λ)i=1Nw^t+tllog(wt+tl)\displaystyle=\sum_{t^{\prime}=1}^{T}\mathbb{E}_{r_{(}\mathcal{B}_{t-1:t-t^{\prime}}^{1:N},a_{1:t^{\prime}-1}^{l:N};\lambda)}\sum_{i=1}^{N}\hat{w}_{t+t^{\prime}}^{l}\nabla\log(w_{t+t^{\prime}}^{l})
t=1Ti=1Nw^t+tllog(wt+tl)\displaystyle\approx\sum_{t^{\prime}=1}^{T}\sum_{i=1}^{N}\hat{w}_{t+t^{\prime}}^{l}\nabla\log(w_{t+t^{\prime}}^{l})
t=tt+T1l=1Nw^t+1llogp({1:Kt+1,t+1l}|{1:Kt,tl}atl)\displaystyle\approx\sum_{t^{\prime}=t}^{t+T-1}\sum_{l=1}^{N}\hat{w}_{t^{\prime}+1}^{l}\nabla\log p(\{\mathcal{B}^{l}_{1:K_{t^{\prime}+1},t^{\prime}+1}\}|\{\mathcal{B}^{l}_{1:K_{t^{\prime}},t^{\prime}}\}^{a_{t^{\prime}}^{l}})
+w^t+1llogp(^1:M,t+1l}|1:Kt+1,t+1l})\displaystyle+\hat{w}_{t^{\prime}+1}^{l}\nabla\log p(\mathcal{\hat{B}}^{l}_{1:M,t^{\prime}+1}\}|\mathcal{B}^{l}_{1:K_{t^{\prime}+1},t^{\prime}+1}\})
w^t+1llogr(1:Kt+1,t+1l;λ)\displaystyle-\hat{w}_{t^{\prime}+1}^{l}\nabla\log r(\mathcal{B}^{l}_{1:K_{t^{\prime}+1},t^{\prime}+1};\lambda)
t=tt+T1l=1Nw^t+1ltrackt+1(atll)+w^t+1idetectt+1(l)\displaystyle\approx\sum_{t^{\prime}=t}^{t+T-1}\sum_{l=1}^{N}\hat{w}_{t^{\prime}+1}^{l}\nabla\mathcal{L}_{track}^{t^{\prime}+1}(a_{t^{\prime}}^{l}\to l)+\hat{w}_{t^{\prime}+1}^{i}\nabla\mathcal{L}_{detect}^{t^{\prime}+1}(l)
w^t+1llogr(1:Kt+1,t+1l;λ)\displaystyle-\hat{w}_{t^{\prime}+1}^{l}\nabla\log r(\mathcal{B}^{l}_{1:K_{t^{\prime}+1},t^{\prime}+1};\lambda) (30)

where w^t+tl=wt+tl/lwt+tl\hat{w}_{t+t^{\prime}}^{l}=w_{t+t^{\prime}}^{l}/\sum_{l}w_{t+t^{\prime}}^{l}. The surrogate loss could be nicely decomposed into the weighted sum of tracking and detection loss over frame {t+1:t+T}\{t+1:t+T\} minus the log density of sampling distributions on sampled particle’s importance weight. Built on the our particle filter sampling algorithm, our semi-supervised learning algorithm could be easily implemented on traditional tracking and detection loss. The only modification is that we include a set of sampled particles as a proxy for object annotation and take the weighted sum of their supervised loss by their importance weight. And this training step is iterated in the whole training process.To avoid introducing additional variations, we omit gradients of logr(1:Kt+1,t+1;λ)\log r(\mathcal{B}_{1:K_{t^{\prime}+1},t^{\prime}+1};\lambda). The detailed training loss is shown in Fig. 4.

6 Experiments

To show the effectiveness of incorporating uncertainty treatment in tracking and detection comparing with traditional methods, we compare our performance with non-Bayesian baselines. Our evaluation is on two commonly used datasets.

ImageNet Video Object Detection Dataset (ILSVRC) [16] contains 30 classes in 3862 training and 555 validation videos. The objects have ground truth annotations of their bounding boxes and track IDs.

M2Cai16-Tool-Locations Dataset [9] extends the M2Cai16-tool dataset. It contains 15 videos record at 25 fps of cholecystectomy procedures at the University Hospital of Strasbourg in France. Among those videos, 2532 frames are labeled under the supervision and spot-checking from a surgeon with medical devices including Grasper, Bipolar, Hook, Scissors, Clipper, Irrigator and Specimen Bag.

To have a fair comparison of the Bayesian approach with the baseline, all the methods use the same structure of training and inference network and sharing training configurations in a much similar way. We only introduce an additional loss term for object transition’s covariance matrix (only diagonal elements) in Eq. (17).

A Evaluation Metrics:

Two evaluation metrics are used for the predication uncertainties quantification. We follow the commonly used benchmark for mAP thresholds at IoU value of 0.5. We also use the a probabilistic measurement Probability Based Detection Quality (PDQ) [5] to jointly quantify the bounding box location and categorical uncertainties for detection estimations. The PDQ score increases as the estimated distribution overlaps with both label’s maximum likelihood and uncertainties.

B Ablation Studies

Our proposed method is compared against five different baselines to study the effectiveness of incorporating Bayesian formulation in each individual part of tracking and detection systems. In each baseline, none or partial Bayesian formulations are considered. We refer the five separate baseline methods as: Single R-FCN frame detector (Single R-FCN), Greedy R-FCN box linking (Greedy R-FCN), Greedy tracking offset R-FCN box linking (Greedy D&\&T) and frame-wise Bayesian inference (Frame Bayesian) and Kalman filter for single object trajectory linking (Kalman-Link).

Single R-FCN takes Greedy Non-maximum Suppression (NMS) outputs directly from R-FCN, while Frame Bayesian infers object states from all cluster boxes (including suppressed and non-suppressed) only with frame-level priors. Greedy R-FCN links Single R-FCN makes predictions frame-by-frame with bipartite matching on IoU score. Greedy D&\&T adds a tracking estimation part from Single R-FCN and links Single R-FCN predictions on tracking offset IoU scores. Kalman-Link links object by bipartite matching and updates box locations by Kalman filter with the trajectories of matched objects.

Table I. shows the results of our methods in comparison with the above five baselines by evaluating on ILSVRC dataset. Our method outperforms all five baselines by mAP and PDQ metrics. At frame level, our methods achieves frame level mAP of 72.172.1 and video level mAP of 75.375.3 by a margin of 0.20.50.2-0.5 over the second best methods. In the measurement of PDQ, our methods achieve frame level of 39.439.4 and video level of 40.240.2 by a margin of 0.20.40.2-0.4 over the second base method. Our method has a large margin of performance gain around 8.18.1 over the baseline of Single R-FCN by PDQ metrics. In the baseline of Frame Bayesian, the performance could outperform Single R-FCN by a margin of 3.73.7 in PDQ by naive inference on a frame-level prior. This performance gain suggests that Greedy NMS is detrimental to the discriminative power of R-FCN. Because it discards a wide spectrum of information that is helpful on distinguishing positive/negative bounding box and uncertainties of object’s bounding box and categorical states. Greedy D&\&T has nearly the same and even a worse of 1.61.6 performance in mAP by comparing with Single R-FCN methods. This a little worsened performance may due to the fact of incorrect fetching between convolution feature maps and correlation kernel. Kalman-Link method reaches the performance only second to our proposed methods for its ability to infer on the uncertainties of locations from the states of its previous trajectories. However its previous trajectories is established by greedy forward matching of object cross frames in a deterministic way. By considering the object matching uncertainties under a uniform prior assumptions on object appearance and associations, our performance could achieve a 1.61.6 gain in PDQ from Kalman-Link. Actually, as we consider the uncertainties in object linking and its states jointly, our algorithm allows for inferring object linking uncertainties reversely by particle reweighting.

Table I: Comparison of our methods with 5 different kinds of baseline. Each baseline removes one or some of modules in our Bayesian framework and replaces with a naive non-Bayesian one. Our Bayesian one outperforms the baseline methods in all categories by introducing uncertainty treatment.
Frame Video
PDQ mAP PDQ mAP
Our Methods 39.4 72.1 40.2 75.3
Single R-FCN 31.7 70.3 32.1 70.9
Frame Bayesian 35.4 71.2 35.9 72.1
Greedy R-FCN 31.7 70.3 32.4 72.3
Greedy D&\&T 31.8 68.7 32.4 72.7
Kalman-Link 37.8 71.9 39.4 74.9

C Semi-Supervised Detection Result

We apply our methods to semi-supervised learning in M2Cai16-Tool-Location Dataset. In our implementation, we take another 3 consecutive frames from the labeling frames in a random order (forward and backward). We train on supervised learning loss in the first stage of our training epochs. And we add our semi-supervised loss term after our supervised training loss converges. Table II. shows our semi-supervised detection result in comparison with the one including supervised term only. Our semi-supervised learning algorithm achieves minor improvements over supervised learning from labeled frames only. Our method observes more obvious improvements on objects with low mAP.

Table II: Comparison of our semi-supervised learning algorithm with learning on supervised frames only.
Supervised Semi-Supervised
PDQ mAP PDQ mAP
Grasper 35.4 46.2 37.4 52.3
Bipolar 51.3 65.9 54.2 67.1
Hook 63.9 78.4 64.1 78.6
Scissors 50.2 66.8 54.3 69.1
Clipper 69.8 85.4 70.2 85.5
Irrigator 11.3 16.2 14.9 23.5
Specimen Bag 60.7 75.8 63.1 76.2

7 Conclusion

In this paper, we present our Bayesian model for multi-object detection and tracking in videos. Our method has shown the potential of formulating neural network model in a probabilistic way, especially for tasks that need to infer under uncertainties.

References

  • [1] David S Bolme, J Ross Beveridge, Bruce A Draper, and Yui Man Lui. Visual object tracking using adaptive correlation filters. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 2544–2550. IEEE, 2010.
  • [2] Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Detect to track and track to detect. In Proceedings of the IEEE International Conference on Computer Vision, pages 3038–3046, 2017.
  • [3] G David Forney. The viterbi algorithm. Proceedings of the IEEE, 61(3):268–278, 1973.
  • [4] András Frank. On kuhn’s hungarian method—a tribute from hungary. Naval Research Logistics (NRL), 52(1):2–5, 2005.
  • [5] David Hall, Feras Dayoub, John Skinner, Peter Corke, Gustavo Carneiro, and Niko Sünderhauf. Probability-based detection quality (pdq): A probabilistic approach to detection evaluation. arXiv preprint arXiv:1811.10800, 2018.
  • [6] Wei Han, Pooya Khorrami, Tom Le Paine, Prajit Ramachandran, Mohammad Babaeizadeh, Honghui Shi, Jianan Li, Shuicheng Yan, and Thomas S Huang. Seq-nms for video object detection. arXiv preprint arXiv:1602.08465, 2016.
  • [7] Ali Harakeh, Michael Smart, and Steven L Waslander. Bayesod: A bayesian approach for uncertainty estimation in deep object detectors. arXiv preprint arXiv:1903.03838, 2019.
  • [8] Zhanghexuan Ji, Mohammad Abuzar Shaikh, Dana Moukheiber, Sargur N Srihari, Yifan Peng, and Mingchen Gao. Improving joint learning of chest x-ray and radiology report by word region alignment. In International Workshop on Machine Learning in Medical Imaging, pages 110–119. Springer, 2021.
  • [9] Amy Jin, Serena Yeung, Jeffrey Jopling, Jonathan Krause, Dan Azagury, Arnold Milstein, and Li Fei-Fei. Tool detection and operative skill assessment in surgical videos using region-based convolutional neural networks. IEEE Winter Conference on Applications of Computer Vision, 2018.
  • [10] Kai Kang, Hongsheng Li, Junjie Yan, Xingyu Zeng, Bin Yang, Tong Xiao, Cong Zhang, Zhe Wang, Ruohui Wang, Xiaogang Wang, et al. T-cnn: Tubelets with convolutional neural networks for object detection from videos. IEEE Transactions on Circuits and Systems for Video Technology, 28(10):2896–2907, 2017.
  • [11] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [12] Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, and Junjie Yan. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4282–4291, 2019.
  • [13] Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, and Xiaolin Hu. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8971–8980, 2018.
  • [14] Chris J Maddison, John Lawson, George Tucker, Nicolas Heess, Mohammad Norouzi, Andriy Mnih, Arnaud Doucet, and Yee Teh. Filtering variational objectives. In Advances in Neural Information Processing Systems, pages 6573–6583, 2017.
  • [15] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
  • [16] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
  • [17] Jack Valmadre, Luca Bertinetto, João Henriques, Andrea Vedaldi, and Philip HS Torr. End-to-end representation learning for correlation filter based tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2805–2813, 2017.
  • [18] Xinshuo Weng and Kris Kitani. A Baseline for 3D Multi-Object Tracking. arXiv:1907.03961, 2019.
  • [19] Tianzhu Zhang, Changsheng Xu, and Ming-Hsuan Yang. Multi-task correlation particle filter for robust object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4335–4343, 2017.
  • [20] Zheng Zhang, Dazhi Cheng, Xizhou Zhu, Stephen Lin, and Jifeng Dai. Integrated object detection and tracking with tracklet-conditioned detection. arXiv preprint arXiv:1811.11167, 2018.
  • [21] Zhipeng Zhang and Houwen Peng. Deeper and wider siamese networks for real-time visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4591–4600, 2019.
  • [22] Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen Wei. Flow-guided feature aggregation for video object detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 408–417, 2017.