This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\floatsetup

[table]capposition=top

11institutetext: Johns Hopkins University, Baltimore MD 21218, USA 22institutetext: Indian Institute of Science, Bangalore 560012, India
22email: {vishwanathsindagi,ryasarl1,vpatel36}@jhu.edu 22email: {deepaksam,venky}@iisc.ac.in

Learning to Count in the Crowd from Limited Labeled Data

Vishwanath A. Sindagi 11    Rajeev Yasarla 11    Deepak Sam Babu 22    R. Venkatesh Babu 22    Vishal M. Patel 11
Abstract

Recent crowd counting approaches have achieved excellent performance. However, they are essentially based on fully supervised paradigm and require large number of annotated samples. Obtaining annotations is an expensive and labour-intensive process. In this work, we focus on reducing the annotation efforts by learning to count in the crowd from limited number of labeled samples while leveraging a large pool of unlabeled data. Specifically, we propose a Gaussian Process-based iterative learning mechanism that involves estimation of pseudo-ground truth for the unlabeled data, which is then used as supervision for training the network. The proposed method is shown to be effective under the reduced data (semi-supervised) settings for several datasets like ShanghaiTech, UCF-QNRF, WorldExpo, UCSD, etc. Furthermore, we demonstrate that the proposed method can be leveraged to enable the network in learning to count from synthetic dataset while being able to generalize better to real-world datasets (synthetic-to-real transfer).

Keywords:
Crowd counting, semi-supervised learning, pseudo-labeling, domain adaptation, synthetic to real transfer

1 Introduction

Due to its significance in several applications (like video surveillance [12, 50, 44], public safety monitoring [57], microscopic cell counting [15], environmental studies [23], etc.), crowd counting has attracted a lot of interest from the deep learning research community. Several convolutional neural network (CNN) based approaches have been developed that address various issues in counting like scale variations, occlusion, background clutter [17, 58, 18, 36, 42, 22, 39, 37, 3, 28, 19, 43, 33, 34, 2], etc. While these methods have achieved excellent improvements in terms of the overall error rate, they follow a fully-supervised paradigm and require several labeled data samples. There is a wide variety of scenes and crowded scenarios that these networks need to handle to in the real world. Due to a distribution gap between the training and testing environments, these networks have limited generalization abilities and hence, procuring annotations becomes especially important. However, annotating data for crowd counting typically involves obtaining point-wise annotations at head locations, and this is a labour intensive and expensive process. Hence, it is infeasible to procure annotations for all possible scenarios. Considering this, it is crucial to reduce the annotation efforts, especially for crowd counting methods which get deployed in a wide variety of scenarios.

Refer to caption

(a)                               (b)

Figure 1: Results of semi-supervised learning experiments. (a) ShanghaiTech A (b) UCF-QNRF. For both datasets, the error increases with reduction in the %-age of labeled data. By leveraging the unlabeled dataset using the proposed GP-based framework, we are able to reduce the error considerably. Note that 𝒟\mathcal{D_{L}} and 𝒟𝒰\mathcal{D_{U}} indicate labeled and unlabeled dataset, respectively.

With the exception of a few works [6, 22, 55], reducing annotation efforts while maintaining good performance is relatively less explored for the task of crowd counting. Hence, in this work, we focus on learning to count using limited labeled data while leveraging unlabeled data to improve the performance. Specifically, we propose a Gaussian Process (GP) based iterative learning framework where we augment the existing networks with capabilities to leverage unlabeled data, thereby resulting in overall improvement in the performance. The proposed framework follows a pseudo-labeling approach, where we estimate the pseudo-ground truth (pseudo-GT) for the unlabeled data, which is then used to supervise the network. The network is trained iteratively on labeled and unlabeled data. In the labeled stage, the network weights are updated by minimizing the L2L_{2} error between predictions and the ground-truth (GT) for the labeled data. In addition, we save the latent space vectors of the labeled data along with the ground-truths. In the unlabeled stage, we first model the relationship between the latent space vectors of the labeled images along with the corresponding ground-truth and unlabeled latent space vectors jointly using GP. Next, we estimate the pseudo-GT for the unlabeled inputs using the GP modeled earlier. This pseudo-GT is then used to supervise the network for the unlabeled data. Minimizing the error between the unlabeled data predictions and the pseudo-GT results in improved performance. Fig. 1 illustrates the effectiveness of the proposed GP-based framework in exploiting unlabeled data on two datasets (ShanghaiTech-A [60] and UCF-QNRF[10]) in the reduced data setting. It can be observed that the proposed method is able to leverage unlabeled data effectively resulting in lower error across various settings.

The proposed method is evaluated on different datasets like ShanghaiTech [60], UCF-QNRF [10], WorldExpo [58], UCSD [4], etc. in the reduced data settings. In addition to obtaining lower error as compared to the existing methods [22], the performance drop due to less data is improved by a considerable margin. Furthermore, the proposed method is effective for learning to count from synthetic data as well. More specifically, we use labeled synthetic crowd counting dataset (GCC [55]) and unlabeled real-world datasets (ShanghaiTech [60], UCF-QNRF [10], WorldExpo [58], UCSD [5]) in our framework, and show that it is able to generalize better to real-world datasets as compared to recent domain adaptive crowd counting approaches [55]. To summarize, the following are our contributions:

  • We propose a GP-based framework to effectively exploit unlabeled data during the training process, resulting in improved overall performance. The proposed method consists of iteratively training over labeled and unlabeled data. For the unlabeled data, we estimate the pseudo-GT using the GP modeled during labeled phase.

  • We demonstrate that the proposed framework is effective in semi-supervised and synthetic-to-real transfer settings. Through various ablation studies, we show that the proposed method is generalizable to different network architectures and various reduced data settings.

2 Related Work

Crowd Counting. Traditional approaches in crowd counting ([16, 31, 7, 9, 15, 27, 56]) typically involved feature extraction techniques and training regression algorithms. Recently, CNN-based approaches like [54, 58, 36, 1, 51, 26, 60, 36, 42] have surpassed the traditional approaches by a large margin in terms of the overall error rate. Most of these methods focus on addressing the issue of large variations in scales. Approaches like [60, 36, 42] focus on improving the receptive field. Different from these, approaches like [28, 41, 47, 32] focus on effective ways of fusing multi-scale information from deep networks. In addition to scale variation, recent approaches have addressed other issues in crowd counting like improving the quality of predicted density maps using adversarial regularization [42, 37], use of deep negative correlation-based learning for obtaining more generalizable features, and scale-based feature aggregation [3]. Most recently, several methods have employed additional information like segmentation and semantic priors [61, 53], attention [20, 45, 46], perspective [38], context information [21], multiple-views [59] and multi-scale features [11], adaptive density maps [52] into the network. In other efforts, researchers have made important contributions by creating large-scale datasets for counting like UCF-QNRF [10], GCC [55] and JHU-CROWD [48, 49]. For a more detailed discussion on these methods, the reader is referred to recent comprehensive surveys [43, 8]

Learning from limited data. Recent research in crowd counting has been largely focused on improving the counting performance in the fully-supervised paradigm. Very few works like [6, 22, 55] have made efforts on minimizing annotation efforts. Loy et al.[6] proposed a semi-supervised regression framework that exploit underlying geometric structures of crowd patterns to assimilate the count estimation of two nearby crowd pattern points in the manifold. However, this approach is specifically designed for video-based crowd counting.

Recently, Liu et al.[22] proposed to leverage additional unlabeled data for counting by introducing a learning to rank framework. They assume that any sub-image of a crowded scene image is guaranteed to contain the same number or fewer persons than the super-image. They employ pairwise ranking hinge loss to enforce this ranking constraint for unlabeled data in addition to the L2L_{2} error to train the network. In our experiments we observed that this constraint is almost always satisfied, and it provides relatively less supervision over unlabeled data.

Babu et al.[35] focus on a different approach, where they train 99.9% of their parameters from unlabeled data using a novel unsupervised learning framework based on winner-takes-all (WTA) strategy. However, they still train the remaining set of parameters using labeled data.

Wang et al.[55] take a totally different approach to minimize annotation efforts by creating a new synthetic crowd counting dataset (GCC). Additionally, they propose a Cycle-GAN based domain adaptive approach for generalizing the network trained on synthetic dataset to real-world dataset. However,there is a large gap in terms of the style and also the crowd count between the synthetic and real-world scenarios. Domain adaptive approaches have limited abilities in handling such scenarios. In order to obtain successful adaptation, the authors in [55] manually select the samples from the synthetic dataset that are closer to the real-world scenario in terms of crowd count for training the network. This selection is possible when one has information about the count from the real-world datasets, which violates the assumption of lack of unlabeled data in the target domain for unsupervised domain adaptation.

Considering the drawbacks of existing approaches, we propose a new GP-based iterative training framework to exploit unlabeled data.

3 Preliminaries

In this section, we briefly review the concepts (crowd counting, semi-supervised learning and Gaussian Process) that are used in this work.

Crowd counting. Following recent works [58, 60], we employ the approach of density estimation technique. That is, an input crowd image is forwarded through the network, and the network outputs a density map. This density map indicates the per-pixel count of people in the image. The count in the image is obtained by integrating over the density map. For training the network using labeled data, the ground-truth density maps are obtained by imposing 2D Gaussians at head location xgx_{g} using D(x)=xgS𝒩(xxg,σ)D(x)=\sum_{{x_{g}\in S}}\mathcal{N}(x-x_{g},\sigma). Here, σ\sigma is the Gaussian kernel’s scale and SS is the list of all locations of people.

Problem formulation. We are given a set of labeled dataset of input-GT pairs ({x,y}𝒟\{x,y\}\in\mathcal{D_{L}}) and a set of unlabeled input data samples x𝒟𝒰x\in\mathcal{D_{U}}. The objective is to fit a mapping-function f(x|ϕ)f(x|\phi) (with parameters defined by ϕ\phi) that accurately estimates target label yy for unobserved samples. Note that this definition applies to both semi-supervised setting and synthetic-to-real transfer setting. In the case of synthetic-to-real transfer, the synthetic dataset is labeled and hence, can be used as the labeled dataset (𝒟\mathcal{D_{L}}). Similarly, the real-world dataset is unlabeled and can be used as the unlabeled dataset (𝒟𝒰\mathcal{D_{U}}).

In order to learn the parameters, both labeled and unlabeled datasets are exploited. Typically, loss functions such as L1L_{1}, L2L_{2} or cross entropy error are used for labeled data. For exploiting unlabeled data 𝒟𝒰\mathcal{D_{U}}, existing approaches augment f(x|ϕ)f(x|\phi) with information like shape of the data manifold [25] via different techniques such as enforcing consistent regularization [13], virtual adversarial training [24] or pseudo-labeling [14]. In this work, we employ pseudo-labeling based approach where we estimate pseudo-GT for unlabeled data, and then use them for supervising the network using traditional supervised loss functions.

Gaussian process. A Gaussian process (GP) f(v)f(v) is an infinite collection of random variables, any finite subset of which have a joint Gaussian distribution. A GP is fully specified by its mean function (m(v)m(v)) and covariance function K(v,v)K(v,v^{\prime}). These are defined below:

m(v)\displaystyle m(v) =𝔼[f(v)],\displaystyle=\mathbb{E}[f(v)], (1)
K(v,v)\displaystyle{K}\left(v,v^{\prime}\right) =𝔼[(f(v)m(v))(f(v)m(v))],\displaystyle=\mathbb{E}\left[(f(v)-m(v))\left(f\left(v^{\prime}\right)-m\left(v^{\prime}\right)\right)\right], (2)

where v,v𝒱v,v^{\prime}\in\mathcal{V} denote the possible inputs that index the GP. The covariance matrix is computed from the covariance function K{K} which expresses the notion of smoothness of the underlying function. GP can then be formulated as follows:

f(v)𝒢𝒫(m(v),K(v,v)+σϵ2I),f(v)\sim\mathcal{GP}(m(v),K(v,v^{\prime})+\sigma_{\epsilon}^{2}I), (3)

where II is identity matrix and σϵ2\sigma_{\epsilon}^{2} is the variance of the additive noise. Any collection of function values is then jointly Gaussian as follows

f(V)=[f(v1),,f(vn)]T𝒩(μ,K(V,V)+σϵ2I),f(V)=\left[f\left(v_{1}\right),\ldots,f\left(v_{n}\right)\right]^{T}\sim\mathcal{N}\left(\mu,K(V,V^{\prime})+\sigma_{\epsilon}^{2}I\right), (4)

with mean vector and covariance matrix defined by the GP as mentioned earlier. To make predictions at unlabeled points, one can compute a Gaussian posterior distribution in closed form by conditioning on the observed data. For more details, we refer the reader to [29].

4 GP-based iterative learning

Fig. 2 gives an overview of the proposed method. The network is constructed using an encoder fe(x,ϕe)f_{e}(x,\phi_{e}) and a decoder fd(z,ϕd)f_{d}(z,\phi_{d}), that are parameterized by ϕe\phi_{e} and ϕd\phi_{d}, respectively. The proposed framework is agnostic to the encoder network, and we show in the experiments section that it generalizes well to architectures such as VGG16 [40], ResNet-50 and ResNet-101 [30]. The decoder consists of a set of 2 conv-relu layers (see supplementary material for more details). Typically, an input crowd image xx is forwarded through the encoder network to obtain the corresponding latent space vector zz. This vector is then forwarded through the decoder network to obtain the crowd density output yy, i.e, y=fd(fe(x,ϕe),ϕd)y=f_{d}(f_{e}(x,\phi_{e}),\phi_{d}).

We are given a training dataset, 𝒟=𝒟𝒟𝒰\mathcal{D}=\mathcal{D_{L}}\cup\mathcal{D_{U}}, where 𝒟={xli,yli}i=1Nl\mathcal{D_{L}}=\{x_{l}^{i},y_{l}^{i}\}_{i=1}^{N_{l}} is a labeled dataset containing NlN_{l} training samples and 𝒟𝒰={xui}i=1Nu\mathcal{D_{U}}=\{x_{u}^{i}\}_{i=1}^{N_{u}} is an unlabeled dataset containing NuN_{u} training samples. The proposed framework effectively leverages both the datasets by iterating the training process over labeled 𝒟\mathcal{D_{L}} and unlabeled datasets 𝒟𝒰\mathcal{D_{U}}. More specifically, the training process consists of two stages: (i) Labeled training stage: In this stage, we employ supervised loss function s\mathcal{L}_{s} to learn the network parameters using labeled dataset, and (ii) Unlabeled training stage: We generate pseudo GTs for the unlabeled data points using the GP formulation, which is then used for supervising the network on the unlabeled dataset. In what follows, we describe these stages in detail.

Refer to caption
Figure 2: Illustration of the proposed framework. Training is performed iteratively over labeled and unlabeled data. For labeled data, we minimize the L2L_{2} error between the predictions and GT. For unlabeled data, we minimize the L2L_{2} error between the predictions and pseudo-GT.

4.1 Labeled stage

Since the labeled dataset 𝒟\mathcal{D_{L}} comes with annotations, we employ L2L_{2} error between the predictions and the GTs as supervision loss for training the network. This loss objective is defined as follows:

s=2=ylpredyl2,\mathcal{L}_{s}=\mathcal{L}_{2}=\|y^{pred}_{l}-y_{l}\|_{2}, (5)

where ylpred=g(zl,ϕd)y^{pred}_{l}=g(z_{l},\phi_{d}) is the predicted output, yly_{l} is the ground-truth, z=h(x,ϕe)z=h(x,\phi_{e}) is the intermediate latent space vector. Note that, the subscript ll in the above quantities indicate that these are defined for labeled data.

Along with performing supervision on the labeled data, we additionally save feature vectors zliz_{l}^{i}’s from the intermediate latent space in a matrix FzlF_{z_{l}}. Specifically, Fzl={zli}i=1NlF_{z_{l}}=\{z_{l}^{i}\}_{i=1}^{N_{l}}. This matrix is used for computing the pseudo-GTs for unlabeled data at a later stage. The dimension of FzlF_{z_{l}} matrix is Nl×MN_{l}\times M. Here, MM is the dimension of the latent space vector zlz_{l}. In our case, the latent space vector dimension is 64×32×3264\times 32\times 32 (see supplementary material for more details), which is reshaped to 1×65,5361\times 65,536. Hence, M=65,536M=65,536.

4.2 Unlabeled stage

Since the unlabeled data 𝒟𝒰\mathcal{D_{U}} does not come with any GT annotations, we estimate pseudo-GTs which are then used as supervision for training the network on unlabeled data. For this purpose, we model the relationship between the latent space vectors of the labeled images FzlF_{z_{l}} along with the corresponding GT TylT_{y_{l}} and unlabeled latent space vectors zupredz^{pred}_{u} jointly using GP.

Estimation of pseudo-GT: As discussed earlier, the training process iterates over labeled 𝒟\mathcal{D_{L}} and unlabeled data 𝒟𝒰\mathcal{D_{U}}. After the labeled stage, the labeled latent space vectors FzlF_{z_{l}} and their corresponding GT density maps TylT_{y_{l}} are used to model the function tt which maps the relationship between the latent vectors and the output density maps as, y=t(z)y=t(z). Using GP, we model this function t(.)t(.) as an infinite collection of functions of which any finite subset is jointly Gaussian. More specifically, we jointly model the distribution of the function values t(.)t(.) of the latent space vectors of the labeled and the unlabeled samples using GP as follows:

P(t(z)|𝒟,Fzl,Tyl)𝒢𝒫(μ,K(Fzl,Fzl)+σϵ2I),P(t(z)|\mathcal{D_{L}},F_{z_{l}},T_{y_{l}})\sim\mathcal{GP}(\mu,K(F_{z_{l}},F_{z_{l}})+\sigma_{\epsilon}^{2}I), (6)

where μ\mu is the function value computed using GP, σϵ2\sigma_{\epsilon}^{2} is set equal to 1, and KK is the kernel function. Based on this, the conditional joint distribution for the latent space vector zukz^{k}_{u} of the kthk^{th} unlabeled sample xukx^{k}_{u} can be expressed as the following Gaussian distribution:

P(t(zuk)|𝒟,Fzl,Tzl)=𝒩(μuk,Σuk),P(t(z_{u}^{k})|\mathcal{D_{L}},F_{z_{l}},T_{z_{l}})=\mathcal{N}(\mu_{u}^{k},\Sigma_{u}^{k}), (7)

where

μuk=K(zuk,Fzl)[K(Fzl,Fzl)+σϵ2I]1Tyl,\mu_{u}^{k}=K(z_{u}^{k},F_{z_{l}})[K(F_{z_{l}},F_{z_{l}})+\sigma_{\epsilon}^{2}I]^{-1}T_{y_{l}}, (8)
Σuk=\displaystyle\Sigma_{u}^{k}={} K(zuk,zuk)K(zuk,Fzl)[K(Fzl,Fzl)+σϵ2I]1K(Fzl,zuk)+σϵ2\displaystyle K(z_{u}^{k},z_{u}^{k})-K(z_{u}^{k},F_{z_{l}})[K(F_{z_{l}},F_{z_{l}})+\sigma_{\epsilon}^{2}I]^{-1}K(F_{z_{l}},z_{u}^{k})+\sigma_{\epsilon}^{2} (9)

where σϵ2\sigma_{\epsilon}^{2} is set equal to 1 and KK is a kernel function with the following definition:

K(Z,Z)k,i=κ(zuk,zli)=zuk,zli|zuk||zli|.K(Z,Z)_{k,i}=\kappa(z_{u}^{k},z_{l}^{i})=\frac{\langle z_{u}^{k},z_{l}^{i}\rangle}{|z_{u}^{k}|\cdot|z_{l}^{i}|}. (10)

Considering the large dimensionality of the latent space vector, K(Fzl,Fzl)K(F_{z_{l}},F_{z_{l}}) can grow quickly in size especially if the number of labeled data samples NlN_{l} is high. In such cases, the computational and memory requirements become prohibitively high. Additionally, all the latent vectors may not be necessarily effective since these vectors correspond to different regions of images in terms of content and size/density of the crowd. In order to overcome these issues, we use only those labeled vectors that are similar to the unlabeled latent vector. Specifically, we consider only NnN_{n} nearest labeled vectors corresponding to an unlabeled vector. That is, we replace FzlF_{z_{l}} by Fzl,nF_{z_{l},n} in Eq. (7)-(9). Here Fzl,n={zlj:zljnearest(zuk,Fzl,Nn)}F_{z_{l},n}=\{z_{l}^{j}:z_{l}^{j}\in nearest(z_{u}^{k},F_{z_{l}},N_{n})\}, and Tyl,n={ylj:zljnearest(zuk,Fzl,Nn)}T_{y_{l},n}=\{y^{j}_{l}:z_{l}^{j}\in nearest(z_{u}^{k},F_{z_{l}},N_{n})\} with nearest(p,Q,Nn)nearest(p,Q,N_{n}) being a function that finds top NnN_{n} nearest neighbors of pp in QQ.

The pseudo-GT for unlabeled data sample is given by the mean predicted in Eq. (8), i.e, yu,pseudok=μuky_{u,pseudo}^{k}=\mu_{u}^{k}. The L2L_{2} distance between the predictions yu,predk=g(zuk,ϕe)y^{k}_{u,pred}=g(z_{u}^{k},\phi_{e}) and the pseudo-GT yu,pseudoky_{u,pseudo}^{k} is used as supervision for updating the parameters of the encoder fe(,ϕe)f_{e}(\cdot,\phi_{e}) and the decoder fd(.,ϕd)f_{d}(.,\phi_{d}).

Furthermore, the pseudo-GT estimated using Eq. (8) may not be necessarily perfect. Errors in pseudo-GT will limit the performance of the network. To overcome this, we explicitly exploit the variance modeled by the GP. Specifically, we minimize the predictive variance by considering Eq. (9) in the loss function. As discussed earlier, using all the latent space vectors of labeled data may not be necessarily effective. Hence, we minimize the variance Σu,nk\Sigma_{u,n}^{k} computed between zukz^{k}_{u} and the NnN_{n} nearest neighbors in the latent space vectors using GP. Thus, the loss function during the unlabeled stage is defined as:

un=1|Σu,nk|yu,predkyu,pseudok2+logΣu,nk,\mathcal{L}_{un}=\frac{1}{|\Sigma_{u,n}^{k}|}\|{y}^{k}_{u,pred}-{y}_{u,pseudo}^{k}\|_{2}+\log\Sigma_{u,n}^{k}, (11)

where yu,predky^{k}_{u,pred} is the crowd density map prediction obtained by forwarding an unlabeled input image xukx_{u}^{k} through the network, yu,pseudok=μuky_{u,pseudo}^{k}=\mu_{u}^{k} is the pseudo-GT (see Eq. (8)), and Σu,nk\Sigma_{u,n}^{k} is the predictive variance obtained by replacing FzlF_{z_{l}} in Eq. (9) with Fzl,nF_{z_{l},n}.

4.3 Final objective function

We combine the supervised loss Eq. (5) and unsupervised loss Eq. (11) to obtain the final objective function as follows:

f=s+λunun,\mathcal{L}_{f}=\mathcal{L}_{s}+\lambda_{un}\mathcal{L}_{un}, (12)

where λun\lambda_{un} is a hyper-parameter that weighs the unsupervised loss.

5 Experiments and results

In this section, we discuss the details of the various experiments conducted to demonstrate the effectiveness of the proposed method. Since the proposed method is able to leverage unlabeled data to improve the overall performance, we performed evaluation in two settings: (i) Semi-supervised settings: In this setting, we varied the percentage of labeled samples from 5% to 75%. We first show that with the base network, there is performance drop due to the reduced data. Later, we show that the proposed method is able to recover a major percentage of the performance drop. (ii) Synthetic-to-real transfer settings: In this setting, the goal is to train on synthetic dataset (labeled), while adapting to real-world dataset. Unlabeled images from the real-world are available during training. In both settings, the proposed method is able to achieve better results as compared to recent methods. Details of the datasets are provided in the supplementary material.

5.1 Semi-supervised settings

In this section, we conduct experiments in the semi-supervised settings by reducing the amount of labeled data available during training. The rest of the samples in the dataset are considered as unlabeled samples wherever applicable. In the following sub-sections, we present comparison of the proposed method in the 5% setting with other recent methods. For comparison, we used 4 datasets: ShanghaiTech (SH-A/B)[60], UCF-QNRF [10], WorldExpo [58] and UCSD [4]. This is followed by a detailed ablation study involving different architectures and various percentages of labeled data used during training. For ablation, we chose ShanghaiTech-A and UCF-QNRF datasets since they contain a wide diversity of scenes and large variation in count and scales.

Implementation details. We train the network using Adam optimizer with a learning rate of 10e510e-{5} and a momentum of 0.90.9 on an NVIDIA Titan Xp GPU. We use batch size of 24. During training, random crops of size 256×256256\times 256 are used. During inference, the entire image is forwarded through the network. For evaluation, we use mean absolute error (MAEMAE) and mean squared error (MSEMSE) metrics, which are defined as: MAE=1Ni=1N|yiyi|MAE=\frac{1}{N}\sum_{i=1}^{N}|y_{i}-y^{\prime}_{i}| and MSE=1Ni=1N|yiyi|2MSE=\sqrt{\frac{1}{N}\sum_{i=1}^{N}|y_{i}-y^{\prime}_{i}|^{2}}, respectively. Here, NN is the total number of test images, yiy_{i} is the ground-truth/target count of people in the image and yiy^{\prime}_{i} is the predicted count of people in to the ithi^{th} image. We set aside 10% of the training set for the purpose of validation. The hyper-parameter λun\lambda_{un} was chosen based on the validation performance. More details are provided in the supplementary.

Table 1: Comparison of results in SSL settings. Reducing labeled data to 5% results in performance drop by a big margin as compared to 100% data. ResNet-50 was used as the encoder network for all the methods. RL: Ranking-Loss. GP: Gaussian-Process. AG: Average Gain %2{}^{\ref{ftn:ag}}.
Method 𝒟\mathcal{D_{L}} 𝒟𝒰\mathcal{D_{U}} SH-A SH-B UCF-QNRF WExpo UCSD
MAE MSE AG MAE MSE AG MAE MSE AG MAE AG MAE MSE AG
ResNet-50 (Oracle) 100% - 76 126 - 8.4 14.5 - 114 195 - 10.1 - 1.7 2.1 -
ResNet-50 (𝒟\mathcal{D_{L}}-only) 5% - 118 211 - 21.2 34.2 - 186 295 - 14.2 - 2.2 2.8 -
ResNet-50+RL 5% 95% 115 208 2.0 20.1 32.9 4.0 182 291 1.7 14.0 0.01 2.2 2.8 0
ResNet-50+GP(Ours) 5% 95% 102 172 16 15.7 27.9 22 160 275 10 12.8 10 2.0 2.4 12

Comparison with recent approaches. Here, we compare the effectiveness of the proposed method with a recent method by Liu et al.[22] on 4 different datasets. In order to get a better understanding of the overall improvements, we also provide the results of the base network with (i) 100% labeled data supervision that is the oracle performance, and (ii) 5% labeled data supervision.

For all the methods (except oracle), we limited the labeled data used during training to 5% of the training dataset. Rest of the samples were used as unlabeled samples. We used ResNet-50 as the encoder network. The results of the experiments are shown in Table 1. For all the experiments that we conducted, we report the average of the results for 5 trials. The standard deviations are reported in the supplementary. We make the following observations for all the datasets: (i) Compared to using the entire dataset, reducing the labeled data during training (to 5%) leads to significant increase in error. (ii) The proposed GP-based framework is able to reduce the performance drop by a large margin. Further, the proposed method achieves an average gain (AG)111AG=Gmae+Gmse2AG=\frac{G_{mae}+G_{mse}}{2}, Gmae=mae(𝒟𝒰+𝒟)mae(𝒟)mae(𝒟)G_{mae}=\frac{mae_{(\mathcal{D_{U}+D_{L}})}-mae_{(\mathcal{D_{L}})}}{mae_{(\mathcal{D_{L}})}}, Gmse=mse(𝒟𝒰+𝒟)mse(𝒟)mse(𝒟)G_{mse}=\frac{mse_{(\mathcal{D_{U}+D_{L}})}-mse_{(\mathcal{D_{L}})}}{mse_{(\mathcal{D_{L}})}} of anywhere between 10%-22% over the 𝒟\mathcal{D_{L}}-only baseline across all datasets. (iii) The proposed method is able to leverage the unlabeled data more effectively as compared to Liu et al.[22]. This is because the authors in [22] using a ranking loss on the unlabeled data which is based on the assumption that sub-image of a crowded scene is guaranteed to contain the same or fewer number of people compared to the entire image. We observed that this constraint is satisfied naturally for most of the unlabeled images, and hence it provides less supervision (see supplementary material for a detailed analysis).

Table 2: Results of ablation study with different %-ages of labeled data. The proposed method achieves significant gains across different percentages of labeled data. We used ResNet-50 as the encoder network for all the experiments. AG: Average Gain %2{}^{\ref{ftn:ag}}.
𝒟\mathcal{D_{L}} % SH-A UCF-QNRF
No-GP (𝒟\mathcal{D_{L}}-only) GP (𝒟+𝒟𝒰\mathcal{D_{L}+D_{U}}) AG % No-GP (𝒟\mathcal{D_{L}}-only) GP (𝒟+𝒟𝒰\mathcal{D_{L}+D_{U}}) AG %
MAE MSE MAE MSE MAE MSE MAE MSE
5 118 211 102 172 16 186 295 160 275 10
25 110 160 91 149 12 178 252 147 226 14
50 102 149 89 148 6.1 158 250 136 218 13
75 93 146 88 139 4.7 139 240 129 210 9.8
100 76 126 - - - 114 195 - - -

Ablation of labeled data percentage. We conducted an ablation study where we varied the percentage of labeled data used during the training process. More specifically, we used 44 different settings: 5%, 25%, 50% and 75%. The remaining data were used as unlabeled samples. We used ResNet-50 as the network encoder for all the settings. This ablation study was conducted on 2 datasets: ShanghaiTech-A (SH-A) and UCF-QNRF. The results of this ablation study are shown in Table 2. It can be observed for both datasets that as the percentage of labeled data is reduced, the performance of the baseline network drops significantly. However, the proposed GP-based framework is able to leverage unlabeled data in all the cases to reduce this performance drop by a considerable margin. Fig. 3 and 4 show sample qualitative results on ShanghaiTech-A and UCF-QNRF datasets for the semi-supervised protocol with 5% labeled data setting. It can be observed that the proposed method is able to predict the density maps more accurately as compared to the baseline method that does not consider unlabeled data.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption

(a)                      (b)                      (c)                      (d)                      

Figure 3: Results of SSL experiments on the ShanghaiTech-A [60] dataset using the 5% labeled data setting. (a): Input. (b) No-GP (c) Proposed Method (d) Ground-truth.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption

(a)                      (b)                      (c)                      (d)

Figure 4: Results of SSL experiments on the UCF-QNRF [10] dataset using the 5% labeled data setting. (a): Input. (b) No-GP (c) Proposed Method (d) Ground-truth.
Table 3: Results of ablation study with different networks. The proposed method is able to exploit unlabeled data irrespective of different architectures. We used 5% of the training data as labeled set, and the rest as unlabeled samples. AG: Average Gain %2{}^{\ref{ftn:ag}}.
Net 𝒟\mathcal{D_{L}}% SH-A UCF-QNRF
No-GP(𝒟\mathcal{D_{L}}-only)
GP(𝒟+𝒟𝒰\mathcal{D_{L}+D_{U}})
AG %
No-GP (𝒟\mathcal{D_{L}}-only)
GP (𝒟+𝒟𝒰\mathcal{D_{L}+D_{U}})
AG %
MAE MSE MAE MSE MAE MSE MAE MSE
ResNet-50 100 76 126 - - - 114 195 - -
5 118 211 102 172 16 186 295 160 275 10
ResNet-101 100 76 117 - - - 116 197 - - -
5 131 200 110 162 18 196 324 174 288 11
VGG16 100 74 118 - - - 120 197 - -
5 121 205 112 163 14 188 316 175 291 7.4

Architecture ablation. We conducted an ablation study where we evaluated the proposed method using different architectures. More specifically, we used different networks like ResNet-50, ResNet-101 and VGG16 as encoder network. The ablation was performed on 2 datasets: ShanghaiTech-A (SH-A) and UCF-QNRF. For all the experiments, we used 5% of the training dataset as labeled dataset, and the rest were used as unlabeled samples. The results of this experiment are shown in Table 3. Based on these results, we make the following observations: (i) Since networks like VGG16 and ResNet-101 have higher number of parameters, they tend to overfit more in the reduced-data setting as compared to ResNet-50. (ii) The proposed GP-based method obtains consistent gains by leveraging unlabeled dataset across different architectures.

Refer to caption
Figure 5: Histogram for pseudo-GT errors (errpseudouerr_{pseudo}^{u}) and prediction errors (errpreduerr_{pred}^{u}) on unlabeled data during training. Note that pseudo-GT errors are concentrated on the lower end, implying that they are more closer to the ground truth as compared to the predictions. Hence, pseudo-GTs provide meaningful supervision.

Pseudo-GT Analysis. In order to gain a deeper understanding about the effectiveness of the proposed approach, we plot the histogram of normalized errors with respect to the predictions ypreduy_{pred}^{u} of the network and the pseudo-GT ypseudouy_{pseudo}^{u} for the unlabeled data during the training process. Specifically, we plot histograms of errpreduerr_{pred}^{u} and errpseudouerr_{pseudo}^{u}, where errpredu=|ypreduygtu|ygtuerr_{pred}^{u}=\frac{|y_{pred}^{u}-y_{gt}^{u}|}{y_{gt}^{u}} and errpseudou=|ypseudouygtu|ygtuerr_{pseudo}^{u}=\frac{|y_{pseudo}^{u}-y_{gt}^{u}|}{y_{gt}^{u}}. Here, ygtuy_{gt}^{u} is the actual GT corresponding to the unlabeled data sample. The plot is shown in Fig. 5. It can be observed that the pseudo-GT errors are concentrated in the lower end of the error region as compared to the prediction errors. This implies that the pseudo-GTs are more closer to the GTs than the predictions. Hence, the pseudo-GTs obtained using the proposed method are able to provide good quality supervision on the unlabeled data.

5.2 Synthetic-to-Real transfer setting

Recently, Wang et al.[55] proposed a synthetic crowd counting dataset (GCC) that consists of 15,212 images with a total of 7,625,843 annotations. The primary purpose of this dataset is to reduce the annotation efforts by training the networks on the synthetic dataset, thereby eliminating the need for labeling. However, due to a gap between the synthetic and real-world data distributions, the networks trained on synthetic dataset perform poorly on real-world images. In order to overcome this issue, the authors in [55] proposed a Cycle-GAN based domain adaptive approach that additionally enforces SSIM consistency. More specifically, they first learn to translate from synthetic crowd images to real-world images using SSIM-based Cycle-GAN. This transfers the style in the synthetic image to more real-world style. The translated synthetic images are then used to train a counting network (SFCN) that is based on ResNet-101 architecture.

While this approach improves the error over the baseline methods, its performance is essentially limited in the case of large distribution gap between real and synthetic images. Moreover, the authors in [55] perform a manual selection of synthetic samples for training the network. This selections ensures that only samples that are closer to the real-world images in terms of the count are used for training. Such a selection is not feasible in the case of unsupervised domain adaptation where we have no access to labels in the target dataset.

Table 4: Comparison of results in synthetic-to-real transfer settings. We train the network on synthetic crowd counting dataset (GCC), and leverage the training set of real-world datasets without any labels. We used the same network as described in [55].
Method SH-A SH-B UCF-QNRF UCF-CC-50 WExpo
MAE MSE MAE MSE MAE MSE MAE MSE MAE
No Adapt 160 217 22.8 30.6 276 459 487 689 42.8
Cycle GAN [62] 143 204 24.4 39.7 257 401 405 548 32.4
SE Cycle GAN [55] 123 193 19.9 28.3 230 384 373 529 26.3
Proposed Method 121 181 12.8 19.2 210 351 355 505 20.4

The proposed GP-based framework overcomes these drawbacks easily and can be extended to the synthetic-to-real transfer setting as well. We consider the synthetic data as labeled training set and real-world training set as unlabeled dataset, and train the network to leverage the unlabeled dataset. The results of this experiment are reported in Table 4. We used the same network (SFCN) and training process as described in [55]. As it can be observed, the proposed method achieves considerable improvements compared to the recent approach. Since we estimate the pseudo-GT for unlabeled real-world images and use it as supervision directly, the distribution gap that the network needs to handle is much lesser. This results in better performance compared to the domain adaptive approach [55]. Unlike [55], we train the network on the unlabeled data and hence, we do not need to perform any synthetic sample selection. Fig. 6 and 7 show sample qualitative results on the ShanghaiTech-A and UCF-QNRF datasets for the synthetic-to-real transfer protocol. The proposed method is able to predict the density maps more accurately as compared to the baseline.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption

(a)                      (b)                      (c)                      (d)

Figure 6: Results of Synthetic-to-Real transfer experiments on ShanghaiTech-A dataset. (a): Input. (b) No Adapt (c) Proposed Method (d) Ground-truth.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption

(a)                      (b)                      (c)                      (d)

Figure 7: Results of Synthetic-to-Real transfer experiments on the UCF-QNRF [10] dataset. (a): Input. (b) No Adapt. (c) Proposed Method. (d) Ground-truth.

6 Conclusions

In this work, we focused on learning to count in the crowd from limited labeled data. Specifically, we proposed a GP-based iterative learning framework that involves estimation of pseudo-GT for unlabeled data using Gaussian Processes, which is then used as supervision for training the network. Through various experiments, we show that the proposed method can be effectively used in a variety of scenarios that involve unlabeled data like learning with less data or synthetic to real-world transfer. In addition, we conducted detailed ablation studies to demonstrate that the proposed method generalizes well to different network architectures and is able to achieve consistent gains for different amounts of labeled data.

Acknowledgement

This work was supported by the NSF grant 1910141.

Supplementary Material

Due to limited space in the main paper, we present additional details about the proposed method and experiments in the supplementary.

Encoder and Decoder Architecture

Here, we provide details of the encoder and decoder architecture for all the experiments.

Encoder: In the main paper, we conducted experiments with 4 different networks for the encoder: For semi-supervised experiments, we used Res50, Res101 and VGG16. For learning from synthetic data we used Res101-SFCN [55] .Following are the details:

(i) Res50: First 3 layers of Res50 are used as the encoder.

(ii) Res101: First 3 layers of Res101 are used as the encoder.

(iii) VGG16: First 10 layers of VGG16 are used as the encoder.

(iv) Res101-SFCN: We use the network exactly as described in [55]. In this network, the layers until final dilated conv layer are considered as a part of the encoder.

For all the above networks, the features of the final encoder layer are forwarded through a 1×11\times 1 conv layer to reduce the dimensionality to 64 channels. The output of this 1×11\times 1 conv is the feature embedding in the latent space which is used in GP modeling. Since the train crop size is 256×256256\times 256, the intermediate feature maps in the latent space is of dimension 64×32×3264\times 32\times 32.

Decoder: We use the same decoder in all the semi-supervised learning experiments. The decoder consists of 2 conv-relu layers. The first one is a 3×33\times 3 conv layer, that takes in 64 channels and outputs 64 channels. The final layer is a a 1×11\times 1 layer that takes in 64 channels and outputs 1 channel which is the density map. The final conv layer is followed by an bilinear-upsampling layer that upsamples the output density to the resolution of the input image.

In case of learning from the synthetic data, since we use the same network as in [55], all the layers after the dilated conv layers are used as decoder.

Dataset Details

In this section, we provide details of the different datasets used for evaluating the proposed method in the main paper.

ShanghaiTech [60]:This dataset contains 1198 annotated images with a total of 330,165 people. This dataset consists of two parts: Part A with 482 images and Part B with 716 images. Both parts are further divided into training and test datasets with training set of Part A containing 300 images and that of Part B containing 400 images. Rest of the images are used as test set.

UCF-QNRF [10]: UCF-QNRF is a large crowd counting dataset with 1535 high-resolution images and 1.25 million head annotations. There are 1201 training images and 334 test images. It contains extremely congested scenes where the maximum count of an image can reach 12865.

WorldExpo [58]: The WorldExpo’10 dataset was introduced by Zhang et al.. [58] and it contains 3,980 annotated frames from 1,132 video sequences captured by 108 surveillance cameras. The frames are divided into training and test sets. The training set contains 3,380 frames and the test set contains 600 frames from five different scenes with 120 frames per scene. They also provided Region of Interest (ROI) map for each of the five scenes.

UCSD [4]: The UCSD dataset crowd counting dataset consists of 2000 frames from a single scene. These scenes contain relatively sparse crowds with the number of people ranging from 11 to 46 per frame. A region of interest (ROI) is pro- vided for the scene in the dataset. Of the 2000 frames, frames 601 through 1400 are used for training while the remaining frames are held out for testing.

GCC [55]:GTA V Crowd Counting Dataset (GCC) is a large-scale synthetic dataset based on an electronic game, which consists of 15,212 crowd images. GCC provides three evaluation strategies (random splitting, cross-camera,and cross-location evaluation).

Table 5: Effect of λun\lambda_{un} on ShanghaiTech Part-A val set.
λun\lambda_{un} MAE MSE
0.0 102 175
0.2 100 162
0.4 89 149
0.6 85 140
0.8 88 147
1.0 92 156

Hyper-parameter λun\lambda_{un}

In this section, we study the effect of λun\lambda_{un} on the overall performance. λun\lambda_{un} weighs the unsupervised loss function in the Eq. 12 of main paper. For this study, we use the ShanghaiTech A dataset, due to its wide variety of scenes and diversity in the count. We conducted this experiment for the 5% data setting where 5% of the data was used as labeled data and rest was used as unlabeled data. We used Res50 encoder. Note that we perform the evaluation on the held-out validation set (and not on the test set). The results for different values of λun\lambda_{un} are shown in Table 5.

We observed that the performance peaks when the value of λun\lambda_{un} is 0.60.6. λun=0\lambda_{un}=0 corresponds to only labeled data. This is the baseline performance. As we increase λun\lambda_{un}, we observe that the error improves. However, for λun>0.6\lambda_{un}>0.6, we see a small drop. This is because the network would not have learned to optimal level at the initial stages of training. Due to this the pseud-GT will be erroneous, and hence, using high weight for unsupervised at initial stages prohibits the network from reaching optimal performance.

Based on this experiment, we use λun=0.6\lambda_{un}=0.6 for all the experiments.

Table 6: Semi-supervised experiments with recent crowd counting methods. We used 5% of the training data as labeled set, and the rest as unlabeled samples. AG: Average Gain %2{}^{\ref{ftn:ag}}.
Net 𝒟\mathcal{D_{L}}% SH-A UCF-QNRF
No-GP(𝒟\mathcal{D_{L}}-only)
GP(𝒟+𝒟𝒰\mathcal{D_{L}+D_{U}})
AG %
No-GP (𝒟\mathcal{D_{L}}-only)
GP (𝒟+𝒟𝒰\mathcal{D_{L}+D_{U}})
AG %
MAE MSE MAE MSE MAE MSE MAE MSE
Res101-SFCN 100 74 114 - - - 113 196 - - -
5 128 199 109 160 17 193 323 172 282 12
CSRNet 100 71 112 - - - 123 195 - -
5 120 200 111 159 14 187 310 171 293 7.0
Table 7: Results in SSL settings. Reducing labeled data to 5% results in performance drop by a big margin as compared to 100% data. Res50 was used as the encoder network for all the methods. RL: Ranking-Loss. GP: Gaussian-Process. AG: Average Gain %2{}^{\ref{ftn:ag}}.
Method 𝒟\mathcal{D_{L}} 𝒟𝒰\mathcal{D_{U}} SH-A SH-B UCF-QNRF WExpo UCSD
MAE MSE AG MAE MSE AG MAE MSE AG MAE AG MAE MSE AG
Ours 5% 95% 102 ±\pm 0.8 172 ±\pm 2.1 16 15.7 ±\pm 0.9 27.9 (±\pm 1.1) 22 160 ±\pm 2.4 275 ±\pm 3.1 10 12.8 ±\pm 0.5 10 2.0 ±\pm 0.05 2.4 ±\pm 0.09 12
Table 8: Results for synthetic-to-real transfer settings. We train the network on synthetic crowd counting dataset (GCC), and leverage the training set of real-world datasets without any labels. We used the same network and training/evaluation protocol as in [55].
Method SH-A SH-B UCF-QNRF UCF-CC-50 WExpo
MAE MSE MAE MSE MAE MSE MAE MSE MAE
Ours 121 ±\pm 0.6 181 ±\pm 1.6 12.8±\pm 0.3 19.2±\pm 0.9 210±\pm 2.7 351±\pm 4.1 355 ±\pm 4.4 505±\pm 5.9 20.4 ±\pm 0.9

Additional Architecture Ablation

In this section, we conducted additional architecture ablation experiments using two recent crowd counting techniques: CSRNet [19] and Res101-SFCN [55]. WE use the 5% data-setting, where we use 5% of the data as labeled and rest as unlabeled. We evaluated both these methods on ShanghaiTech-A (SH-A) and UCF-QNRF datasets.For CSRNet, we use the layers upto the last dilated conv as the encoder. For the decoder, we use 2 conv layers as described earlier.

The results of this experiment are shown in Table 6. In addition to MAE/MSE, we rerport Average Gain (AG)222AG=Gmae+Gmse2AG=\frac{G_{mae}+G_{mse}}{2}, Gmae=mae(𝒟𝒰+𝒟)mae(𝒟)mae(𝒟)G_{mae}=\frac{mae_{(\mathcal{D_{U}+D_{L}})}-mae_{(\mathcal{D_{L}})}}{mae_{(\mathcal{D_{L}})}}, Gmse=mse(𝒟𝒰+𝒟)mse(𝒟)mse(𝒟)G_{mse}=\frac{mse_{(\mathcal{D_{U}+D_{L}})}-mse_{(\mathcal{D_{L}})}}{mse_{(\mathcal{D_{L}})}}. We observed consistent gains in both the cases when we used the proposed GP-based method to leverage unlabeled data.

Multiple Trials

In this section, we report the standard-deviations for the experiments with our proposed method corresponding to Table 1 and Table 4 in the main paper. See Table 7 and Table 8. Note that the standard deviations are computed using 5 trials.

References

  • [1] Arteta, C., Lempitsky, V., Zisserman, A.: Counting in the wild. In: European Conference on Computer Vision. pp. 483–498. Springer (2016)
  • [2] Babu Sam, D., Sajjan, N.N., Venkatesh Babu, R., Srinivasan, M.: Divide and grow: Capturing huge diversity in crowd images with incrementally growing cnn. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3618–3626 (2018)
  • [3] Cao, X., Wang, Z., Zhao, Y., Su, F.: Scale aggregation network for accurate and efficient crowd counting. In: European Conference on Computer Vision. pp. 757–773. Springer (2018)
  • [4] Chan, A.B., Liang, Z.S.J., Vasconcelos, N.: Privacy preserving crowd monitoring: Counting people without people models or tracking. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. pp. 1–7. IEEE (2008)
  • [5] Chan, A.B., Vasconcelos, N.: Bayesian poisson regression for crowd counting. In: 2009 IEEE 12th International Conference on Computer Vision. pp. 545–551. IEEE (2009)
  • [6] Change Loy, C., Gong, S., Xiang, T.: From semi-supervised to transfer counting of crowds. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2256–2263 (2013)
  • [7] Chen, K., Loy, C.C., Gong, S., Xiang, T.: Feature mining for localised crowd counting. In: European Conference on Computer Vision (2012)
  • [8] Gao, G., Gao, J., Liu, Q., Wang, Q., Wang, Y.: Cnn-based density estimation and crowd counting: A survey. arXiv preprint arXiv:2003.12783 (2020)
  • [9] Idrees, H., Saleemi, I., Seibert, C., Shah, M.: Multi-source multi-scale counting in extremely dense crowd images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2547–2554 (2013)
  • [10] Idrees, H., Tayyab, M., Athrey, K., Zhang, D., Al-Maadeed, S., Rajpoot, N., Shah, M.: Composition loss for counting, density map estimation and localization in dense crowds. In: European Conference on Computer Vision. pp. 544–559. Springer (2018)
  • [11] Jiang, X., Xiao, Z., Zhang, B., Zhen, X., Cao, X., Doermann, D., Shao, L.: Crowd counting and density estimation by trellis encoder-decoder network. arXiv preprint arXiv:1903.00853 (2019)
  • [12] Kang, D., Ma, Z., Chan, A.B.: Beyond counting: Comparisons of density maps for crowd analysis tasks-counting, detection, and tracking. arXiv preprint arXiv:1705.10118 (2017)
  • [13] Laine, S., Aila, T.: Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242 (2016)
  • [14] Lee, D.H.: Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks
  • [15] Lempitsky, V., Zisserman, A.: Learning to count objects in images. In: Advances in Neural Information Processing Systems. pp. 1324–1332 (2010)
  • [16] Li, M., Zhang, Z., Huang, K., Tan, T.: Estimating the number of people in crowded scenes by mid based foreground segmentation and head-shoulder detection. In: Pattern Recognition, 2008. ICPR 2008. 19th International Conference on. pp. 1–4. IEEE (2008)
  • [17] Li, T., Chang, H., Wang, M., Ni, B., Hong, R., Yan, S.: Crowded scene analysis: A survey. IEEE Transactions on Circuits and Systems for Video Technology 25(3), 367–386 (2015)
  • [18] Li, W., Mahadevan, V., Vasconcelos, N.: Anomaly detection and localization in crowded scenes. IEEE transactions on pattern analysis and machine intelligence 36(1), 18–32 (2014)
  • [19] Li, Y., Zhang, X., Chen, D.: Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1091–1100 (2018)
  • [20] Liu, N., Long, Y., Zou, C., Niu, Q., Pan, L., Wu, H.: Adcrowdnet: An attention-injective deformable convolutional network for crowd understanding. arXiv preprint arXiv:1811.11968 (2018)
  • [21] Liu, W., Salzmann, M., Fua, P.: Context-aware crowd counting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5099–5108 (2019)
  • [22] Liu, X., van de Weijer, J., Bagdanov, A.D.: Leveraging unlabeled data for crowd counting by learning to rank. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018)
  • [23] Lu, H., Cao, Z., Xiao, Y., Zhuang, B., Shen, C.: Tasselnet: Counting maize tassels in the wild via local counts regression network. Plant Methods 13(1),  79 (2017)
  • [24] Miyato, T., Maeda, S.i., Koyama, M., Ishii, S.: Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence 41(8), 1979–1993 (2018)
  • [25] Oliver, A., Odena, A., Raffel, C.A., Cubuk, E.D., Goodfellow, I.: Realistic evaluation of deep semi-supervised learning algorithms. In: Advances in Neural Information Processing Systems. pp. 3235–3246 (2018)
  • [26] Onoro-Rubio, D., López-Sastre, R.J.: Towards perspective-free object counting with deep learning. In: European Conference on Computer Vision. pp. 615–629. Springer (2016)
  • [27] Pham, V.Q., Kozakaya, T., Yamaguchi, O., Okada, R.: Count forest: Co-voting uncertain number of targets using random forest for crowd density estimation. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 3253–3261 (2015)
  • [28] Ranjan, V., Le, H., Hoai, M.: Iterative crowd counting. In: European Conference on Computer Vision. pp. 278–293. Springer (2018)
  • [29] Rasmussen, C.E.: Gaussian processes in machine learning. In: Summer School on Machine Learning. pp. 63–71. Springer (2003)
  • [30] Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems. pp. 91–99 (2015)
  • [31] Ryan, D., Denman, S., Fookes, C., Sridharan, S.: Crowd counting using multiple local features. In: Digital Image Computing: Techniques and Applications, 2009. DICTA’09. pp. 81–88. IEEE (2009)
  • [32] Sam, D.B., Babu, R.V.: Top-down feedback for crowd counting convolutional neural network. In: Thirty-second AAAI conference on artificial intelligence (2018)
  • [33] Sam, D.B., Peri, S.V., Kamath, A., Babu, R.V., et al.: Locate, size and count: Accurately resolving people in dense crowds via detection. arXiv preprint arXiv:1906.07538 (2019)
  • [34] Sam, D.B., Peri, S.V., Mukuntha, N., Babu, R.V.: Going beyond the regression paradigm with accurate dot prediction for dense crowds. In: 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 2853–2861. IEEE (2020)
  • [35] Sam, D.B., Sajjan, N.N., Maurya, H., Babu, R.V.: Almost unsupervised learning for dense crowd counting. In: Thirty-Third AAAI Conference on Artificial Intelligence (2019)
  • [36] Sam, D.B., Surya, S., Babu, R.V.: Switching convolutional neural network for crowd counting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
  • [37] Shen, Z., Xu, Y., Ni, B., Wang, M., Hu, J., Yang, X.: Crowd counting via adversarial cross-scale consistency pursuit. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018)
  • [38] Shi, M., Yang, Z., Xu, C., Chen, Q.: Revisiting perspective information for efficient crowd counting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 7279–7288 (2019)
  • [39] Shi, Z., Zhang, L., Liu, Y., Cao, X., Ye, Y., Cheng, M.M., Zheng, G.: Crowd counting with deep negative correlation learning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018)
  • [40] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)
  • [41] Sindagi, V.A., Patel, V.M.: Cnn-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In: Advanced Video and Signal Based Surveillance (AVSS), 2017 IEEE International Conference on. IEEE (2017)
  • [42] Sindagi, V.A., Patel, V.M.: Generating high-quality crowd density maps using contextual pyramid cnns. In: The IEEE International Conference on Computer Vision (ICCV) (Oct 2017)
  • [43] Sindagi, V.A., Patel, V.M.: A survey of recent advances in cnn-based single image crowd counting and density estimation. Pattern Recognition Letters (2017)
  • [44] Sindagi, V.A., Patel, V.M.: Dafe-fd: Density aware feature enrichment for face detection. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 2185–2195. IEEE (2019)
  • [45] Sindagi, V.A., Patel, V.M.: Ha-ccn: Hierarchical attention-based crowd counting network. arXiv preprint arXiv:1907.10255 (2019)
  • [46] Sindagi, V.A., Patel, V.M.: Inverse attention guided deep crowd counting network. arXiv preprint (2019)
  • [47] Sindagi, V.A., Patel, V.M.: Multi-level bottom-top and top-bottom feature fusion for crowd counting. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1002–1012 (2019)
  • [48] Sindagi, V.A., Yasarla, R., Patel, V.M.: Pushing the frontiers of unconstrained crowd counting: New dataset and benchmark method. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1221–1231 (2019)
  • [49] Sindagi, V.A., Yasarla, R., Patel, V.M.: Jhu-crowd++: Large-scale crowd counting dataset and a benchmark method. arXiv preprint arXiv:2004.03597 (2020)
  • [50] Toropov, E., Gui, L., Zhang, S., Kottur, S., Moura, J.M.: Traffic flow from a low frame rate city camera. In: Image Processing (ICIP), 2015 IEEE International Conference on. pp. 3802–3806. IEEE (2015)
  • [51] Walach, E., Wolf, L.: Learning to count with cnn boosting. In: European Conference on Computer Vision. pp. 660–676. Springer (2016)
  • [52] Wan, J., Chan, A.: Adaptive density map generation for crowd counting. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1130–1139 (2019)
  • [53] Wan, J., Luo, W., Wu, B., Chan, A.B., Liu, W.: Residual regression with semantic prior for crowd counting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4036–4045 (2019)
  • [54] Wang, C., Zhang, H., Yang, L., Liu, S., Cao, X.: Deep people counting in extremely dense crowds. In: Proceedings of the 23rd ACM international conference on Multimedia. pp. 1299–1302. ACM (2015)
  • [55] Wang, Q., Gao, J., Lin, W., Yuan, Y.: Learning from synthetic data for crowd counting in the wild. arXiv preprint arXiv:1903.03303 (2019)
  • [56] Xu, B., Qiu, G.: Crowd density estimation based on rich features and random projection forest. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 1–8. IEEE (2016)
  • [57] Zhan, B., Monekosso, D.N., Remagnino, P., Velastin, S.A., Xu, L.Q.: Crowd analysis: a survey. Machine Vision and Applications 19(5-6), 345–357 (2008)
  • [58] Zhang, C., Li, H., Wang, X., Yang, X.: Cross-scene crowd counting via deep convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 833–841 (2015)
  • [59] Zhang, Q., Chan, A.B.: Wide-area crowd counting via ground-plane density maps and multi-view fusion cnns. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8297–8306 (2019)
  • [60] Zhang, Y., Zhou, D., Chen, S., Gao, S., Ma, Y.: Single-image crowd counting via multi-column convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 589–597 (2016)
  • [61] Zhao, M., Zhang, J., Zhang, C., Zhang, W.: Leveraging heterogeneous auxiliary tasks to assist crowd counting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 12736–12745 (2019)
  • [62] Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision. pp. 2223–2232 (2017)