This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Strong Baseline for Crowd Counting and Unsupervised People Localization

Liangzi Rong1&Chunping Li1111Contact Author
1School of Software, Tsinghua University, Beijing, China
[email protected], [email protected]
Abstract

In this paper, we explore a strong baseline for crowd counting and an unsupervised people localization algorithm based on estimated density maps. Firstly, existing methods achieve state-of-the-art performance based on different backbones and kinds of training tricks. We collect different backbones and training tricks and evaluate the impact of changing them and develop an efficient pipeline for crowd counting, which decreases MAE and RMSE significantly on multiple datasets. We also propose a clustering algorithm named isolated KMeans to locate the heads in density maps. This method can divide the density maps into subregions and find the centers under local count constraints without training any parameter and can be integrated with existing methods easily.

1 Introduction

Crowd counting is a computer vision task which aims to output a density map indicating the distribution of crowd and get the estimated count of people by calculating the integral over the map in still images. In recent years, many novel networks are designed and the MAE on ShanghaiTechA dataset has been improved from 110.2(MCNN) to 61.7(SPN). It is important to note, however, that the improvement is not only the result of newly proposed methods. It also benefits from various training tricks in implementation. According to He et al. (2019) training details including data preprocessing, changes of loss function, the ground truth generation, and learning objectives setting make a big difference.

Refer to caption
(a) Overview of our strong baseline for crowd counting and people localization.
Refer to caption
(b) Comparison of MAE and RMSE curve on ShanghaiTech PartA dataset.
Figure 1: Overview of our strong baseline and results.

Since Li et al. (2018), most researchers extend their work using VGG-16 as the backbone, and the main components of the fully convolutional networks are like the front-end and back-end in 1(a). Existing method Wang and Breckon (2019) has shown that training tricks can make a big difference. Besides, there are also some methods using other classification networks as the backbone. To make a fair comparison and construct a strong baseline easy to follow for the academia and the industry, we collect and evaluate a set of backbones and tricks for regressing a more accurate density map. As Figure 1(b) shows, by combining these tricks, the MAE can be decreased from 74.9 to 60.1, and RMSE can be decreased from 126.1 to 95.5, which is better than SPN. They show consistent improvement on multiple datasets which proves the effectiveness of this strong baseline.

Another drawback of existing previous methods is the lack of position information for each person. Note that position information is essential for us to understanding the crowd images, we extend our work to find the position of each person. Liu et al. (2019a) and Liu et al. (2019d) have combined detection networks to located heads. However, there are no standard labels in crowd counting dataset as in detection datasets, so the ground truth is not reliable. Besides, we have to retrain the parameters of detection networks for different datasets which tend to be time consuming in practical use. In fact, the estimated density maps have indicated the positions by showing clusters around each head center. A human can find most of the heads easily by observing the clusters. Based on this observation, we proposed an unsupervised localization method named isolated KMeans based on cluster algorithms. Because people count nn can be obtained by calculating the integral, the objective of this problem is simplified into finding nn centers indicating the position of nn persons. Each point whose value is bigger than 0 in density maps is regarded as a potential head and cluster centers converge to the points near the heads. In addition, isolated KMeans can locate the head centers under local count constraints to make the number of cluster centers and people match both globally and locally. We can integrate this localization method with any other existing crowd counting methods without training numerous parameters. To the best of our knowledge, we are the first to use unsupervised method to do people localization in this field.

To summarize, our main contribution is three-fold:

  • We collect a series of backbones and training tricks for crowd counting and evaluate their impacts on the model performance to validate the effectiveness and make a fair comparison.

  • We develop a strong baseline by combining the effective tricks which can improve the accuracy and density map quality significantly on three widely used datasets.

  • We propose an unsupervised method for people localization in density maps. It is achieved using a cluster algorithm named isolated KMeans which divides one density map into subregions and locate the heads under local people count constraints. This method frees us of training a large number of parameters of detection networks and can be integrated with any existing counting method.

2 The Setup of the Standard Baseline

All datasets only provide discrete point labels in the centers of heads, which is hard to learn. We firstly convert the labeled points into density maps with continuous distribution.

Ground Truth Density Map Generation. Following most previous works, each labeled point at a head’s center is substituted with a Gaussian distribution, and superimposing multiple Gaussian distributions produces the ground truth density map. This is formulated as

Mi=j=1nδ(xheadj)Gσj(x)M_{i}=\sum_{j=1}^{n}\delta(x-head_{j})*G_{\sigma_{j}}(x) (1)

where MiM_{i} means the density map, and headjhead_{j} is the coordinates of jthj_{th} head annotation. G(x)G(x) is the Gaussian distribution. nn is the number of head annotations. In ShanghaiTech PartB dataset, the standard deviation parameter σj\sigma_{j} is set to 15. In other datasets, σj\sigma_{j} is calculated as σj=βdj\sigma_{j}=\beta*d_{j} where djd_{j} is the average distance of xjx_{j} and its kk nearest neighbors. In this paper we adopt the configuration of β=0.3\beta=0.3 and k=3k=3.

Training Settings and Evaluation Metrics. Experiments are conducted on ShanghaiTech PartA dataset for comparison. We use PyTorch to implement all training process. For all models, the total training epochs is set to 200 and initial learning rate is set to 1e-5. Adam optimizer and ReduceOnPlateau learning rate scheduler are used to optimize the parameters. We adopt two mainly used metrics MAE and RMSE for evaluating the accuracy of counting and the quality of the density maps.

Backbones. Most existing methods adopt VGG-16 as the backbone. Some literature also adopts ResNet-50Wang et al. (2019) and InceptionV3Wang and Breckon (2019). Similarly to Li et al. (2018), we remove the fully connected layers and preserve the convolutional layers to compare the performance of these backbones and to explore if batch normalization should be used or not. For back-end, we use the configuration of 3×CS(512,3,2)UpsampleCS(256,3,2)UpsampleCS(128,3,2)UpsampleCS(64,3,2)UpsampleC(1,1,1)3\times CS(512,3,2)-Upsample-CS(256,3,2)-Upsample-CS(128,3,2)-Upsample-CS(64,3,2)-Upsample-C(1,1,1). N×CS(m,s,d)N\times CS(m,s,d) means N convolutional layers integrated with SE block Hu et al. (2018) with mm filters whose size is s×ss\times s and dilation rate is dd followed by the Swish activation layer by Ramachandran et al. (2017). We use bilinear interpolation to upsample the feature maps.

Backbones MAE RMSE
first 10 layers of VGG-16 74.8 126.1
first 13 layers of VGG-16 74.9 118.1
first 13 layers of VGG-16-bn 92.1 134.6
InceptionV3 119.4 170.5
ResNet-50 80.6 130.1
Table 1: Performance of different backbones.

For models without batch normalization, we train them using original images of different sizes. For models with batch normalization, we train them using 256×256256\times 256 patches randomly cropped from the images. To be specific, we randomly choose 4 images at a time and 4 patches are randomly cropped from each of them to constitute a batch. Finally one batch consists of 16 patches of 256×256256\times 256. In experiments, if the back-end contains batch normalization layers, the model converge to output 0. So batch normalization layers only exist in front-end. The performance of different backbones is reported in Table 1. Due to the superior performance, we select first 13 layers of VGG-16 as the backbone and conduct following experiments.

3 Loss Functions

Many loss functions are proposed to improve the quality of density maps. Among them, MSE loss (MSE\ell_{MSE}) is the mostly used one for its simplicity and effectiveness. It is defined as:

MSE=1Ni=1NMi^Mi2\ell_{MSE}=\frac{1}{N}\sum_{i=1}^{N}||\hat{M_{i}}-M_{i}||_{2} (2)

Spatial Abstraction Loss (SALSAL) is proposed by Jiang et al. (2019) which progressively computes the MSE losses on multiple abstraction levels. The computation is formalized as:

SAL=ka=1Ka1NkaMika^Mika2SAL=\sum_{k_{a}=1}^{K_{a}}\frac{1}{N_{k_{a}}}||\hat{M_{i}^{k_{a}}}-M_{i}^{k_{a}}||_{2} (3)

where kak_{a} indicates the abstraction level and KaK_{a} is set to 3. Mika^\hat{M_{i}^{k_{a}}} and MikaM_{i}^{k_{a}} are the downsampled density maps by pooling layers and NkaN_{k_{a}} is the number of pixels within a map.

Multi-scale density level consistency loss (c\ell_{c}) is proposed by Dai et al. (2019). c\ell_{c} loss separates the density maps into subregions using adaptive average pooling and calculates L1 loss to enforce the consistency at different scale levels. It is defined as:

c=s=1S1ks2Mis^Mis1\ell_{c}=\sum_{s=1}^{S}\frac{1}{k_{s}^{2}}||\hat{M_{i}^{s}}-M_{i}^{s}||_{1} (4)

where ss is the scale levels, and ksk_{s} is size of density maps. They are set to 3 and 1×1,2×2,4×41\times 1,2\times 2,4\times 4 respectively.

SSIM is proposed to measure the similarity of images byWang et al. (2004) and Local Pattern Consistency Loss (SSIM\ell_{SSIM}) is introduced by Cao et al. (2018) to enhance the structure consistency. A fixed-parameter kernel W(p)W(p) to adopted to define the weights of different positions in one sliding window. For the same location xx in density maps Mi^\hat{M_{i}} and MiM_{i}, the local statistics are defined as:

μMi^(x)=p𝒫W(p)Mi^(x+p)\mu_{\hat{M_{i}}}({x})=\sum_{p\in\mathcal{P}}W(p)\cdot\hat{M_{i}}(x+p) (5)
σMi^2(x)=p𝒫W(p)[Mi^(x+p)μMi^(x)]2\sigma_{\hat{M_{i}}}^{2}(x)=\sum_{p\in\mathcal{P}}W(p)\cdot\left[\hat{M_{i}}(x+p)-\mu_{\hat{M_{i}}}(x)\right]^{2} (6)
σMi^Mi(x)\displaystyle\sigma_{\hat{M_{i}}\,M_{i}}(x) =p𝒫W(p)[Mi^(x+p)μMi^(x)]\displaystyle=\sum_{p\in\mathcal{P}}W(p)\cdot\left[\hat{M_{i}}(x+p)-\mu_{\hat{M_{i}}}(x)\right] (7)
[Mi(x+p)μMi(x)]\displaystyle\cdot\left[M_{i}(x+p)-\mu_{M_{i}}(x)\right]

and the SSIM index and SSIM loss are calculated as:

SSIM=(2μMi^μMi+C1)(2σMi^Mi+C2)(μMi^2+μMi2+C1)(σMi^2+σMi2+C2)SSIM=\frac{\left(2\mu_{\hat{M_{i}}}\mu_{M_{i}}+C_{1}\right)\left(2\sigma_{\hat{M_{i}}\,M_{i}}+C_{2}\right)}{\left(\mu_{\hat{M_{i}}}^{2}+\mu_{M_{i}}^{2}+C_{1}\right)\left(\sigma_{\hat{M_{i}}}^{2}+\sigma_{M_{i}}^{2}+C_{2}\right)} (8)
SSIM=1SSIM\ell_{SSIM}=1-SSIM (9)

We train the model using different loss functions, and the results are reported in Table 2.

Loss Functions MAE RMSE
MSE\ell_{MSE} 74.9 118.1
MSE\ell_{MSE} + SAL(MaxPooling)SAL(Max\,Pooling) 149.0 225.4
MSE\ell_{MSE} + SAL(AvgPooling)SAL(Avg\,Pooling) 128.4 200.9
MSE+c\ell_{MSE}+\ell_{c} 67.8 105.7
MSE+SSIM\ell_{MSE}+\ell_{SSIM} 90.1 134.6
SSIM\ell_{SSIM} 71.5 115.0
Table 2: Performance of different loss functions.

As we can see, MSE+c\ell_{MSE}+\ell_{c} gets the best MAE performance. Also because of its simple form, we select it as the loss function.

4 Training Tricks

We collect various tricks in training and evaluate each of them. By combining the effective tricks, we can boost the model’s accuracy without changing the architecture.

Data Augmentation. In crowd counting, estimated density maps are sensitive to the size of people. Resize and rotation is harmful to the performance of the network, so we apply cropping to enlarge the training set. There are 4 kinds of cropping: (i) random 0.3 - random0.9 means randomly cropping by ratio 0.3 - 0.9 from each image. (ii) fixed 0.5 means cropping 4 non-overlapping quarters from fixed locations in original images. (iii) fixed + random 0.5 means 4 patches are cropped from fixed locations and 5 patches randomly cropped from each image. (iv) mixed means randomly cropping by a random ratio in {0.3, 0.4, 0.5, 0.6, 0.7}. Table 3 shows that random 0.3 is the most effective data augmentation. In addition, cropping can also reduce the training time.

Curriculum learning. In Wang and Breckon (2019), the curriculum is designed based on the fact that dense crowds are more difficult to count than sparse crowd. In training process, the weights of areas with higher density are relative lower and gradually increased to equal to areas with lower density. This is implemented by a weight matrix W(e)W(e) which is defined as:

W(e)=T(e)max{Mi,T(e)}W(e)=\frac{T(e)}{\max\left\{M_{i},T(e)\right\}} (10)

T(e)T(e) is a threshold matrix which is defined as:

T(e)=kee+beT(e)=k_{e}e+b_{e} (11)

where kek_{e} and beb_{e} are coefficients determined by prior knowledge and ee denotes the epoch number. WW has the same size as the density map MiM_{i}. Then MSE\ell_{MSE} is calculated as:

MSE=1Ni=1NW(e)(Mi^Mi)2\ell_{MSE}=\frac{1}{N}\sum_{i=1}^{N}||W(e)\odot(\hat{M_{i}}-M_{i})||_{2} (12)

Empirically, we set ke=2e3,be=5e3k_{e}=2e-3,b_{e}=5e-3. Using curriculum learning can decrease MAE to 63.3 as Table 3 shows.

Value Expansion. The statistics of pixel values in mainly used datasets can be viewed in Figure 2. Many pixel values are very small which can lead to the loss of precision.

Refer to caption
Figure 2: Statistics of pixel values on three datasets.

To facilitate the training process, we multiply the ground truth density map by a scale factor kexpandk_{expand}. In inference process, we can multiply the estimated density map by 1kexpand\frac{1}{k_{expand}}. Using value expansion, we can improve the MAE to 62.6.

Validate by Patch. Cao et al. (2018) firstly proposes the patch-based validation strategy. In validation process, each image is divided into several quarters, and quarters are fed into the network. By calculating the sum of each quarter’s output, we can get the overall count. Table 3 reports that RMSE can be improved to 95.4 using this strategy.

Training Tricks MAE RMSE
Baseline 67.8 105.7
random 0.3 64.9 100.5
random 0.4 66.3 99.4
random 0.5 66.0 97.8
random 0.6 66.0 98.9
random 0.7 66.5 98.1
random 0.8 67.9 100.7
random 0.9 68.0 100.9
fixed 0.5 68.7 104.0
fixed + random 0.5 65.6 100.8
mixed 69.0 103.6
+ curriculum learning 63.3 97.3
+ value expansion 10 62.6 97.8
+ value expansion 100 66.4 101.6
+ value expansion 1000 67.7 105.1
+ validate by patch 63.1 95.4
Table 3: Performance of adding training tricks.

5 Learning Objectives

Besides density maps, Liu et al. (2019b) and Wang and Breckon (2019) propose using attention maps to emphasize the crowd regions and weaken the impact of background regions. An extra branch with the configuration of 2×CS(64,3,2)C(1,1,1)Sigmoid2\times CS(64,3,2)-C(1,1,1)-Sigmoid is inserted after the penultimate layer and can produce a one-channel attention map MiattM^{att}_{i} with same size of density map MiM_{i}. This branch aims to regress an attention map whose value is close to 1 in foreground regions and close to 0 in background region and is illustrated in Figure 3.

Refer to caption
Figure 3: Network with an attention branch inserted.

Ground Truth Attention Map Generation. There are two ways of generating attention maps. (i) Window-based. We set a 25×2525\times 25 windows centered at each labeled point. Values within those windows are set to 1 and those out of them are set to 0. (ii) Threshold-based. We set the threshold at the value bigger than 40% of the values. We test the two ways, and way (i) shows a more accurate result.

Attention Map Loss. The loss function of attention map is defined as

att(Θ)\displaystyle\ell^{att}(\Theta) =Miattlog(M^iatt)\displaystyle=\|M_{i}^{att}\odot\log\left(\hat{M}_{i}^{att}\right) (13)
+(1Miatt)log(1M^iatt)1\displaystyle+\left(1-M_{i}^{att}\right)\odot\log\left(1-\hat{M}_{i}^{att}\right)\|_{1}

The total loss is a weighted sum:total=MSE+C+λatt\mathcal{L}_{total}=\ell_{MSE}+\ell_{C}+\lambda\ell^{att}. Empirically, when λ=0.5\lambda=0.5, the model gets best performance.

Refer to captionRefer to caption
(a) Image
Refer to captionRefer to caption
(b) Baseline
Refer to captionRefer to caption
(c) Our strong baseline
Refer to captionRefer to caption
(d) Attention maps
Refer to captionRefer to caption
(e) Ground truth
Figure 4: Visualization of estimated density maps and attention maps on ShanghaiTech dataset.

Density Map Size. We change the number of upsample layers to evaluate the impact of different density map sizes. By adding 1 - 4 upsample layers, we can get 18×18\frac{1}{8}\times\frac{1}{8} - 1×11\times 1 size of density maps. Table 4 shows that estimating the same size of density maps benefits the performance.

Learning Objectives MAE RMSE
Baseline 63.1 95.4
+AM(Window-based) λ=0.02\lambda=0.02 66.8 99.3
+AM(Threshold-based) λ=0.02\lambda=0.02 65.8 100.6
+AM λ=0.02\lambda=0.02 61.8 96.9
+AM λ=0.1\lambda=0.1 65.8 100.6
+AM λ=0.5\lambda=0.5 60.1 95.5
18×18\frac{1}{8}\times\frac{1}{8} size 67.2 101.8
14×14\frac{1}{4}\times\frac{1}{4} size 64.7 101.4
12×12\frac{1}{2}\times\frac{1}{2} size 63.8 95.0
1×11\times 1 size 60.1 95.5
Table 4: Performance of different learning objectives. AM means attention map.

6 Comparisons with State-of-the-art

We evaluate our final model with other state-of-the-art methods on three mainly used datasets.

ShanghaiTech dataset is introduced by Zhang et al. (2016) which consists of PartA and PartB. PartA has 300 training images and 182 testing images with relatively high density. PartB has 400 training images and 316 testing images with relatively low density.As table 5 shows, our strong baseline decreases MAE by 2.6% on PartA and RMSE by 0.8% on PartB.

PartA PartB
Models MAE RMSE MAE RMSE
Li et al. (2018) 68.2 115.0 10.6 16.0
Cao et al. (2018) 67.0 104.5 8.4 13.6
Jiang et al. (2019) 64.2 109.1 8.2 12.8
Liu et al. (2019b) 63.2 98.9 7.7 12.9
Liu et al. (2019c) 62.3 100.0 7.8 12.2
Liu et al. (2019a) 65.1 106.7 8.4 14.1
Shi et al. (2019) 62.4 102.0 7.6 11.8
Wang et al. (2019) 64.8 107.5 7.6 13.0
Chen et al. (2019) 61.7 99.5 9.4 14.4
Wan et al. (2019) 63.1 96.2 8.7 13.6
Ours 60.1 95.5 7.9 11.7
Table 5: Comparisons on ShanghaiTech dataset.

UCF-QNRF dataset is introduced by Idrees et al. (2018). It is has 1201 training images and 334 testing images with high resolution. Performance on this dataset is reported in Table 6, and our strong baseline decreases MAE by 2.7%.

UCF_CC_50 dataset is introduced by Idrees et al. (2013) including 50 crowd images with extremely high density. We follow the standard 5-fold cross-validation. It can be viewed in Table 6 that our baseline decreases the MAE by 4.0%.

UCF-QNRF UCF_CC_50
Models MAE RMSE MAE RMSE
Li et al. (2018) - - 266.1 397.5
Cao et al. (2018) - - 258.4 334.9
Jiang et al. (2019) 113.0 188.0 249.4 354.5
Liu et al. (2019b) - - 257.1 363.5
Liu et al. (2019c) 107.0 183.0 212.2 243.7
Liu et al. (2019a) 116.0 195.0 - -
Shi et al. (2019) - - 241.7 320.7
Wang et al. (2019) 102.0 171.4 214.2 318.2
Chen et al. (2019) - - 259.2 335.9
Wan et al. (2019) - - 355.0 560.2
Ours 99.2 179.1 203.7 283.1
Table 6: Comparisons on UCF-QNRF and UCF_CC_50 dataset.

7 Unsupervised People Localization

7.1 Point Set Construction

Refer to caption
Figure 5: Point set construction.

Firstly, we use the density maps to construct a point set. Density maps are multiplied by an expansion factor kk and rounded, so each pixel has an integer to indicate its frequency. In experiments, kk is set to 500. We sample the coordinate (i,j)(i,j) of each pixel by its frequency as Figure 5 shows and construct a point set.

7.2 KMeans

Estimated people count nn is calculated by summing all pixel values for a given density map, which means there are about nn people in it. Consequently, it is convincing to use K=nK=n as the number of clusters. We use KMeans cluster algorithm as a baseline to locate the heads based on the sampled point set. Typical cluster results are shown in the second column in Figure 6.

Datasets δ\delta KMeans Isolated KMeans
ShanghaiTech PartA 40 83.8% 88.3%
20 75.1% 81.6%
10 36.9% 44.2%
ShanghaiTech PartB 40 90.7% 94.3%
20 81.3% 87.0%
10 54.5% 63.9%
UCF-QNRF 40 75.1% 78.0%
20 50.7% 54.6%
10 23.9% 26.1%
UCF_CC_50 40 70.3% 73.9%
20 63.4% 66.8%
10 30.7% 35.9%
Table 7: Localization performance on three crowd benchmarks of KMeans and isolated KMeans in terms of mAP.

Localization precision is calculated in a similar way as the evaluation metric used in MS-COCO Lin et al. (2014): 1) all centers are ranked according to the number of point in corresponding cluster. 2) from the center with the most points to the one with the least points, we match each center to the nearest person in ground truth. 3) calculate the overlap of the two windows of size δ\delta centered at the matched points pair. All labelled persons in ground truth can only be matched once. Average Precision(AP) of KMeans-based localization algorithm is reported in table 7.

Refer to captionRefer to caption
(a) Density Map
Refer to captionRefer to caption
(b) Head centers by KMeans
Refer to captionRefer to caption
(c) Subregions and centers by isolated KMeans
Refer to captionRefer to caption
(d) Head centers by isolated KMeans
Figure 6: Visualization of the unsupervised localization algorithm. In the second and last column, red circles are the ground truth positions, and green circles are the estimated positions of heads. In the three column, different color means different subregions and ’+’ means the centers in each subregion.

7.3 Isolated KMeans

Algorithm 1 Isolated KMeans algorithm for locating heads.
1:  Calculate the whole cluster number K=nK=n by summing all the pixel values in the density map MM.
2:  Construct the points set SS by multiplying the density map MM with an expansion factor kk and rounding.
3:  Use DBSCAN to divide SS into NsubN_{sub} clusters.
4:  for sub-point-set SiS_{i} in cluster 1,…, cluster NsubN_{sub} do
5:     Calculate cluster number KiK_{i} in this subregion: Ki=jinSiMjK_{i}=\sum_{j\,in\,S_{i}}M_{j}
6:  end for
7:  Sort the subregions according to KiK_{i} by ascending order and get cluster 1’,…, cluster NsubN_{sub}^{{}^{\prime}}.
8:  for sub-point-set SiS_{i}^{{}^{\prime}} in cluster 1’,…, cluster Nsub1N_{sub}^{{}^{\prime}}-1 do
9:     Use KMeans to find KiK_{i}^{{}^{\prime}} centers in this subregion.
10:  end for
11:  Use KMeans to find Ki=1Nsub1KiK-\sum_{i=1^{{}^{\prime}}}^{N_{sub}^{{}^{\prime}}-1}K_{i} centers in the last subregion.
12:  return Positions of KK heads.

Although people count nn can match the number of centers KK for the whole map in KMeans, we can see that in some regions the cluster number cannot match the sum of these pixel values. In order to make the cluster centers more consistent with the crowd distribution, we divide the whole density map into several subregions firstly, and then use KMeans to locate the heads in each subregion separately. In this way, people count can be matched with the cluster number both globally and locally.

To be specific, we use DBSCAN of ϵ=5\epsilon=5 to divide the point set into several clusters. Points in each cluster are regarded as in a subregion. In each subregion, the people number ,i.e. cluster number, is calculated by summing the pixel values in this region and then KMeans can be used to locate the heads. This is illustrated in Algorithm 1. AP of isolated KMeans is also calculated in the same way as mentioned before and reported in 7. It can be viewed that the localization accuracy has been improved for all δ\delta parameters. Typical results of divided subregions and their centers and localization results of isolated KMeans are illustrated in Figure 6.

8 Conclusion

In this paper, we collect and evaluate a series of backbones and training tricks in training a neural network for crowd counting. By selecting the best backbone and applying effective training tricks together, we construct an efficient and accurate baseline which improve the MAE and RMSE significantly on three mainly used datasets. We also propose an unsupervised people localization method named isolated KMeans. This clustering algorithm uses the point set constructed from the density maps which eliminates the time of training detection networks, and can also be integrated with any existing counting method.

References

  • Cao et al. [2018] Xinkun Cao, Zhipeng Wang, Yanyun Zhao, and Fei Su. Scale aggregation network for accurate and efficient crowd counting. In Proceedings of the European Conference on Computer Vision (ECCV), pages 734–750, 2018.
  • Chen et al. [2019] Xinya Chen, Yanrui Bin, Nong Sang, and Changxin Gao. Scale pyramid network for crowd counting. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1941–1950. IEEE, 2019.
  • Dai et al. [2019] Feng Dai, Hao Liu, Yike Ma, Juan Cao, Qiang Zhao, and Yongdong Zhang. Dense scale network for crowd counting. arXiv preprint arXiv:1906.09707, 2019.
  • He et al. [2019] Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. Bag of tricks for image classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 558–567, 2019.
  • Hu et al. [2018] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
  • Idrees et al. [2013] Haroon Idrees, Imran Saleemi, Cody Seibert, and Mubarak Shah. Multi-source multi-scale counting in extremely dense crowd images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2547–2554, 2013.
  • Idrees et al. [2018] Haroon Idrees, Muhmmad Tayyab, Kishan Athrey, Dong Zhang, Somaya Al-Maadeed, Nasir Rajpoot, and Mubarak Shah. Composition loss for counting, density map estimation and localization in dense crowds. In Proceedings of the European Conference on Computer Vision (ECCV), pages 532–546, 2018.
  • Jiang et al. [2019] Xiaolong Jiang, Zehao Xiao, Baochang Zhang, Xiantong Zhen, Xianbin Cao, David Doermann, and Ling Shao. Crowd counting and density estimation by trellis encoder-decoder networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6133–6142, 2019.
  • Li et al. [2018] Yuhong Li, Xiaofan Zhang, and Deming Chen. Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1091–1100, 2018.
  • Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • Liu et al. [2019a] Chenchen Liu, Xinyu Weng, and Yadong Mu. Recurrent attentive zooming for joint crowd counting and precise localization. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1217–1226. IEEE, 2019.
  • Liu et al. [2019b] Ning Liu, Yongchao Long, Changqing Zou, Qun Niu, Li Pan, and Hefeng Wu. Adcrowdnet: An attention-injective deformable convolutional network for crowd understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3225–3234, 2019.
  • Liu et al. [2019c] Weizhe Liu, Mathieu Salzmann, and Pascal Fua. Context-aware crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5099–5108, 2019.
  • Liu et al. [2019d] Yuting Liu, Miaojing Shi, Qijun Zhao, and Xiaofang Wang. Point in, box out: Beyond counting persons in crowds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6469–6478, 2019.
  • Ramachandran et al. [2017] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Swish: a self-gated activation function. arXiv preprint arXiv:1710.05941, 7, 2017.
  • Shi et al. [2019] Miaojing Shi, Zhaohui Yang, Chao Xu, and Qijun Chen. Revisiting perspective information for efficient crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7279–7288, 2019.
  • Wan et al. [2019] Jia Wan, Wenhan Luo, Baoyuan Wu, Antoni B Chan, and Wei Liu. Residual regression with semantic prior for crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4036–4045, 2019.
  • Wang and Breckon [2019] Qian Wang and Toby P Breckon. Segmentation guided attention network for crowd counting via curriculum learning. arXiv preprint arXiv:1911.07990, 2019.
  • Wang et al. [2004] Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P Simoncelli, et al. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  • Wang et al. [2019] Qi Wang, Junyu Gao, Wei Lin, and Yuan Yuan. Learning from synthetic data for crowd counting in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8198–8207, 2019.
  • Zhang et al. [2016] Yingying Zhang, Desen Zhou, Siqin Chen, Shenghua Gao, and Yi Ma. Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 589–597, 2016.