A Strong Baseline for Crowd Counting and Unsupervised People Localization
Abstract
In this paper, we explore a strong baseline for crowd counting and an unsupervised people localization algorithm based on estimated density maps. Firstly, existing methods achieve state-of-the-art performance based on different backbones and kinds of training tricks. We collect different backbones and training tricks and evaluate the impact of changing them and develop an efficient pipeline for crowd counting, which decreases MAE and RMSE significantly on multiple datasets. We also propose a clustering algorithm named isolated KMeans to locate the heads in density maps. This method can divide the density maps into subregions and find the centers under local count constraints without training any parameter and can be integrated with existing methods easily.
1 Introduction
Crowd counting is a computer vision task which aims to output a density map indicating the distribution of crowd and get the estimated count of people by calculating the integral over the map in still images. In recent years, many novel networks are designed and the MAE on ShanghaiTechA dataset has been improved from 110.2(MCNN) to 61.7(SPN). It is important to note, however, that the improvement is not only the result of newly proposed methods. It also benefits from various training tricks in implementation. According to He et al. (2019) training details including data preprocessing, changes of loss function, the ground truth generation, and learning objectives setting make a big difference.


Since Li et al. (2018), most researchers extend their work using VGG-16 as the backbone, and the main components of the fully convolutional networks are like the front-end and back-end in 1(a). Existing method Wang and Breckon (2019) has shown that training tricks can make a big difference. Besides, there are also some methods using other classification networks as the backbone. To make a fair comparison and construct a strong baseline easy to follow for the academia and the industry, we collect and evaluate a set of backbones and tricks for regressing a more accurate density map. As Figure 1(b) shows, by combining these tricks, the MAE can be decreased from 74.9 to 60.1, and RMSE can be decreased from 126.1 to 95.5, which is better than SPN. They show consistent improvement on multiple datasets which proves the effectiveness of this strong baseline.
Another drawback of existing previous methods is the lack of position information for each person. Note that position information is essential for us to understanding the crowd images, we extend our work to find the position of each person. Liu et al. (2019a) and Liu et al. (2019d) have combined detection networks to located heads. However, there are no standard labels in crowd counting dataset as in detection datasets, so the ground truth is not reliable. Besides, we have to retrain the parameters of detection networks for different datasets which tend to be time consuming in practical use. In fact, the estimated density maps have indicated the positions by showing clusters around each head center. A human can find most of the heads easily by observing the clusters. Based on this observation, we proposed an unsupervised localization method named isolated KMeans based on cluster algorithms. Because people count can be obtained by calculating the integral, the objective of this problem is simplified into finding centers indicating the position of persons. Each point whose value is bigger than 0 in density maps is regarded as a potential head and cluster centers converge to the points near the heads. In addition, isolated KMeans can locate the head centers under local count constraints to make the number of cluster centers and people match both globally and locally. We can integrate this localization method with any other existing crowd counting methods without training numerous parameters. To the best of our knowledge, we are the first to use unsupervised method to do people localization in this field.
To summarize, our main contribution is three-fold:
-
•
We collect a series of backbones and training tricks for crowd counting and evaluate their impacts on the model performance to validate the effectiveness and make a fair comparison.
-
•
We develop a strong baseline by combining the effective tricks which can improve the accuracy and density map quality significantly on three widely used datasets.
-
•
We propose an unsupervised method for people localization in density maps. It is achieved using a cluster algorithm named isolated KMeans which divides one density map into subregions and locate the heads under local people count constraints. This method frees us of training a large number of parameters of detection networks and can be integrated with any existing counting method.
2 The Setup of the Standard Baseline
All datasets only provide discrete point labels in the centers of heads, which is hard to learn. We firstly convert the labeled points into density maps with continuous distribution.
Ground Truth Density Map Generation. Following most previous works, each labeled point at a head’s center is substituted with a Gaussian distribution, and superimposing multiple Gaussian distributions produces the ground truth density map. This is formulated as
(1) |
where means the density map, and is the coordinates of head annotation. is the Gaussian distribution. is the number of head annotations. In ShanghaiTech PartB dataset, the standard deviation parameter is set to 15. In other datasets, is calculated as where is the average distance of and its nearest neighbors. In this paper we adopt the configuration of and .
Training Settings and Evaluation Metrics. Experiments are conducted on ShanghaiTech PartA dataset for comparison. We use PyTorch to implement all training process. For all models, the total training epochs is set to 200 and initial learning rate is set to 1e-5. Adam optimizer and ReduceOnPlateau learning rate scheduler are used to optimize the parameters. We adopt two mainly used metrics MAE and RMSE for evaluating the accuracy of counting and the quality of the density maps.
Backbones. Most existing methods adopt VGG-16 as the backbone. Some literature also adopts ResNet-50Wang et al. (2019) and InceptionV3Wang and Breckon (2019). Similarly to Li et al. (2018), we remove the fully connected layers and preserve the convolutional layers to compare the performance of these backbones and to explore if batch normalization should be used or not. For back-end, we use the configuration of . means N convolutional layers integrated with SE block Hu et al. (2018) with filters whose size is and dilation rate is followed by the Swish activation layer by Ramachandran et al. (2017). We use bilinear interpolation to upsample the feature maps.
Backbones | MAE | RMSE |
---|---|---|
first 10 layers of VGG-16 | 74.8 | 126.1 |
first 13 layers of VGG-16 | 74.9 | 118.1 |
first 13 layers of VGG-16-bn | 92.1 | 134.6 |
InceptionV3 | 119.4 | 170.5 |
ResNet-50 | 80.6 | 130.1 |
For models without batch normalization, we train them using original images of different sizes. For models with batch normalization, we train them using patches randomly cropped from the images. To be specific, we randomly choose 4 images at a time and 4 patches are randomly cropped from each of them to constitute a batch. Finally one batch consists of 16 patches of . In experiments, if the back-end contains batch normalization layers, the model converge to output 0. So batch normalization layers only exist in front-end. The performance of different backbones is reported in Table 1. Due to the superior performance, we select first 13 layers of VGG-16 as the backbone and conduct following experiments.
3 Loss Functions
Many loss functions are proposed to improve the quality of density maps. Among them, MSE loss () is the mostly used one for its simplicity and effectiveness. It is defined as:
(2) |
Spatial Abstraction Loss () is proposed by Jiang et al. (2019) which progressively computes the MSE losses on multiple abstraction levels. The computation is formalized as:
(3) |
where indicates the abstraction level and is set to 3. and are the downsampled density maps by pooling layers and is the number of pixels within a map.
Multi-scale density level consistency loss () is proposed by Dai et al. (2019). loss separates the density maps into subregions using adaptive average pooling and calculates L1 loss to enforce the consistency at different scale levels. It is defined as:
(4) |
where is the scale levels, and is size of density maps. They are set to 3 and respectively.
SSIM is proposed to measure the similarity of images byWang et al. (2004) and Local Pattern Consistency Loss () is introduced by Cao et al. (2018) to enhance the structure consistency. A fixed-parameter kernel to adopted to define the weights of different positions in one sliding window. For the same location in density maps and , the local statistics are defined as:
(5) |
(6) |
(7) | ||||
and the SSIM index and SSIM loss are calculated as:
(8) |
(9) |
We train the model using different loss functions, and the results are reported in Table 2.
Loss Functions | MAE | RMSE |
---|---|---|
74.9 | 118.1 | |
+ | 149.0 | 225.4 |
+ | 128.4 | 200.9 |
67.8 | 105.7 | |
90.1 | 134.6 | |
71.5 | 115.0 |
As we can see, gets the best MAE performance. Also because of its simple form, we select it as the loss function.
4 Training Tricks
We collect various tricks in training and evaluate each of them. By combining the effective tricks, we can boost the model’s accuracy without changing the architecture.
Data Augmentation. In crowd counting, estimated density maps are sensitive to the size of people. Resize and rotation is harmful to the performance of the network, so we apply cropping to enlarge the training set. There are 4 kinds of cropping: (i) random 0.3 - random0.9 means randomly cropping by ratio 0.3 - 0.9 from each image. (ii) fixed 0.5 means cropping 4 non-overlapping quarters from fixed locations in original images. (iii) fixed + random 0.5 means 4 patches are cropped from fixed locations and 5 patches randomly cropped from each image. (iv) mixed means randomly cropping by a random ratio in {0.3, 0.4, 0.5, 0.6, 0.7}. Table 3 shows that random 0.3 is the most effective data augmentation. In addition, cropping can also reduce the training time.
Curriculum learning. In Wang and Breckon (2019), the curriculum is designed based on the fact that dense crowds are more difficult to count than sparse crowd. In training process, the weights of areas with higher density are relative lower and gradually increased to equal to areas with lower density. This is implemented by a weight matrix which is defined as:
(10) |
is a threshold matrix which is defined as:
(11) |
where and are coefficients determined by prior knowledge and denotes the epoch number. has the same size as the density map . Then is calculated as:
(12) |
Empirically, we set . Using curriculum learning can decrease MAE to 63.3 as Table 3 shows.
Value Expansion. The statistics of pixel values in mainly used datasets can be viewed in Figure 2. Many pixel values are very small which can lead to the loss of precision.

To facilitate the training process, we multiply the ground truth density map by a scale factor . In inference process, we can multiply the estimated density map by . Using value expansion, we can improve the MAE to 62.6.
Validate by Patch. Cao et al. (2018) firstly proposes the patch-based validation strategy. In validation process, each image is divided into several quarters, and quarters are fed into the network. By calculating the sum of each quarter’s output, we can get the overall count. Table 3 reports that RMSE can be improved to 95.4 using this strategy.
Training Tricks | MAE | RMSE |
---|---|---|
Baseline | 67.8 | 105.7 |
random 0.3 | 64.9 | 100.5 |
random 0.4 | 66.3 | 99.4 |
random 0.5 | 66.0 | 97.8 |
random 0.6 | 66.0 | 98.9 |
random 0.7 | 66.5 | 98.1 |
random 0.8 | 67.9 | 100.7 |
random 0.9 | 68.0 | 100.9 |
fixed 0.5 | 68.7 | 104.0 |
fixed + random 0.5 | 65.6 | 100.8 |
mixed | 69.0 | 103.6 |
+ curriculum learning | 63.3 | 97.3 |
+ value expansion 10 | 62.6 | 97.8 |
+ value expansion 100 | 66.4 | 101.6 |
+ value expansion 1000 | 67.7 | 105.1 |
+ validate by patch | 63.1 | 95.4 |
5 Learning Objectives
Besides density maps, Liu et al. (2019b) and Wang and Breckon (2019) propose using attention maps to emphasize the crowd regions and weaken the impact of background regions. An extra branch with the configuration of is inserted after the penultimate layer and can produce a one-channel attention map with same size of density map . This branch aims to regress an attention map whose value is close to 1 in foreground regions and close to 0 in background region and is illustrated in Figure 3.

Ground Truth Attention Map Generation. There are two ways of generating attention maps. (i) Window-based. We set a windows centered at each labeled point. Values within those windows are set to 1 and those out of them are set to 0. (ii) Threshold-based. We set the threshold at the value bigger than 40% of the values. We test the two ways, and way (i) shows a more accurate result.
Attention Map Loss. The loss function of attention map is defined as
(13) | ||||
The total loss is a weighted sum:. Empirically, when , the model gets best performance.










Density Map Size. We change the number of upsample layers to evaluate the impact of different density map sizes. By adding 1 - 4 upsample layers, we can get - size of density maps. Table 4 shows that estimating the same size of density maps benefits the performance.
Learning Objectives | MAE | RMSE |
---|---|---|
Baseline | 63.1 | 95.4 |
+AM(Window-based) | 66.8 | 99.3 |
+AM(Threshold-based) | 65.8 | 100.6 |
+AM | 61.8 | 96.9 |
+AM | 65.8 | 100.6 |
+AM | 60.1 | 95.5 |
size | 67.2 | 101.8 |
size | 64.7 | 101.4 |
size | 63.8 | 95.0 |
size | 60.1 | 95.5 |
6 Comparisons with State-of-the-art
We evaluate our final model with other state-of-the-art methods on three mainly used datasets.
ShanghaiTech dataset is introduced by Zhang et al. (2016) which consists of PartA and PartB. PartA has 300 training images and 182 testing images with relatively high density. PartB has 400 training images and 316 testing images with relatively low density.As table 5 shows, our strong baseline decreases MAE by 2.6% on PartA and RMSE by 0.8% on PartB.
PartA | PartB | |||
---|---|---|---|---|
Models | MAE | RMSE | MAE | RMSE |
Li et al. (2018) | 68.2 | 115.0 | 10.6 | 16.0 |
Cao et al. (2018) | 67.0 | 104.5 | 8.4 | 13.6 |
Jiang et al. (2019) | 64.2 | 109.1 | 8.2 | 12.8 |
Liu et al. (2019b) | 63.2 | 98.9 | 7.7 | 12.9 |
Liu et al. (2019c) | 62.3 | 100.0 | 7.8 | 12.2 |
Liu et al. (2019a) | 65.1 | 106.7 | 8.4 | 14.1 |
Shi et al. (2019) | 62.4 | 102.0 | 7.6 | 11.8 |
Wang et al. (2019) | 64.8 | 107.5 | 7.6 | 13.0 |
Chen et al. (2019) | 61.7 | 99.5 | 9.4 | 14.4 |
Wan et al. (2019) | 63.1 | 96.2 | 8.7 | 13.6 |
Ours | 60.1 | 95.5 | 7.9 | 11.7 |
UCF-QNRF dataset is introduced by Idrees et al. (2018). It is has 1201 training images and 334 testing images with high resolution. Performance on this dataset is reported in Table 6, and our strong baseline decreases MAE by 2.7%.
UCF_CC_50 dataset is introduced by Idrees et al. (2013) including 50 crowd images with extremely high density. We follow the standard 5-fold cross-validation. It can be viewed in Table 6 that our baseline decreases the MAE by 4.0%.
UCF-QNRF | UCF_CC_50 | |||
---|---|---|---|---|
Models | MAE | RMSE | MAE | RMSE |
Li et al. (2018) | - | - | 266.1 | 397.5 |
Cao et al. (2018) | - | - | 258.4 | 334.9 |
Jiang et al. (2019) | 113.0 | 188.0 | 249.4 | 354.5 |
Liu et al. (2019b) | - | - | 257.1 | 363.5 |
Liu et al. (2019c) | 107.0 | 183.0 | 212.2 | 243.7 |
Liu et al. (2019a) | 116.0 | 195.0 | - | - |
Shi et al. (2019) | - | - | 241.7 | 320.7 |
Wang et al. (2019) | 102.0 | 171.4 | 214.2 | 318.2 |
Chen et al. (2019) | - | - | 259.2 | 335.9 |
Wan et al. (2019) | - | - | 355.0 | 560.2 |
Ours | 99.2 | 179.1 | 203.7 | 283.1 |
7 Unsupervised People Localization
7.1 Point Set Construction

Firstly, we use the density maps to construct a point set. Density maps are multiplied by an expansion factor and rounded, so each pixel has an integer to indicate its frequency. In experiments, is set to 500. We sample the coordinate of each pixel by its frequency as Figure 5 shows and construct a point set.
7.2 KMeans
Estimated people count is calculated by summing all pixel values for a given density map, which means there are about people in it. Consequently, it is convincing to use as the number of clusters. We use KMeans cluster algorithm as a baseline to locate the heads based on the sampled point set. Typical cluster results are shown in the second column in Figure 6.
Datasets | KMeans | Isolated KMeans | |
---|---|---|---|
ShanghaiTech PartA | 40 | 83.8% | 88.3% |
20 | 75.1% | 81.6% | |
10 | 36.9% | 44.2% | |
ShanghaiTech PartB | 40 | 90.7% | 94.3% |
20 | 81.3% | 87.0% | |
10 | 54.5% | 63.9% | |
UCF-QNRF | 40 | 75.1% | 78.0% |
20 | 50.7% | 54.6% | |
10 | 23.9% | 26.1% | |
UCF_CC_50 | 40 | 70.3% | 73.9% |
20 | 63.4% | 66.8% | |
10 | 30.7% | 35.9% |
Localization precision is calculated in a similar way as the evaluation metric used in MS-COCO Lin et al. (2014): 1) all centers are ranked according to the number of point in corresponding cluster. 2) from the center with the most points to the one with the least points, we match each center to the nearest person in ground truth. 3) calculate the overlap of the two windows of size centered at the matched points pair. All labelled persons in ground truth can only be matched once. Average Precision(AP) of KMeans-based localization algorithm is reported in table 7.








7.3 Isolated KMeans
Although people count can match the number of centers for the whole map in KMeans, we can see that in some regions the cluster number cannot match the sum of these pixel values. In order to make the cluster centers more consistent with the crowd distribution, we divide the whole density map into several subregions firstly, and then use KMeans to locate the heads in each subregion separately. In this way, people count can be matched with the cluster number both globally and locally.
To be specific, we use DBSCAN of to divide the point set into several clusters. Points in each cluster are regarded as in a subregion. In each subregion, the people number ,i.e. cluster number, is calculated by summing the pixel values in this region and then KMeans can be used to locate the heads. This is illustrated in Algorithm 1. AP of isolated KMeans is also calculated in the same way as mentioned before and reported in 7. It can be viewed that the localization accuracy has been improved for all parameters. Typical results of divided subregions and their centers and localization results of isolated KMeans are illustrated in Figure 6.
8 Conclusion
In this paper, we collect and evaluate a series of backbones and training tricks in training a neural network for crowd counting. By selecting the best backbone and applying effective training tricks together, we construct an efficient and accurate baseline which improve the MAE and RMSE significantly on three mainly used datasets. We also propose an unsupervised people localization method named isolated KMeans. This clustering algorithm uses the point set constructed from the density maps which eliminates the time of training detection networks, and can also be integrated with any existing counting method.
References
- Cao et al. [2018] Xinkun Cao, Zhipeng Wang, Yanyun Zhao, and Fei Su. Scale aggregation network for accurate and efficient crowd counting. In Proceedings of the European Conference on Computer Vision (ECCV), pages 734–750, 2018.
- Chen et al. [2019] Xinya Chen, Yanrui Bin, Nong Sang, and Changxin Gao. Scale pyramid network for crowd counting. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1941–1950. IEEE, 2019.
- Dai et al. [2019] Feng Dai, Hao Liu, Yike Ma, Juan Cao, Qiang Zhao, and Yongdong Zhang. Dense scale network for crowd counting. arXiv preprint arXiv:1906.09707, 2019.
- He et al. [2019] Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. Bag of tricks for image classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 558–567, 2019.
- Hu et al. [2018] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
- Idrees et al. [2013] Haroon Idrees, Imran Saleemi, Cody Seibert, and Mubarak Shah. Multi-source multi-scale counting in extremely dense crowd images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2547–2554, 2013.
- Idrees et al. [2018] Haroon Idrees, Muhmmad Tayyab, Kishan Athrey, Dong Zhang, Somaya Al-Maadeed, Nasir Rajpoot, and Mubarak Shah. Composition loss for counting, density map estimation and localization in dense crowds. In Proceedings of the European Conference on Computer Vision (ECCV), pages 532–546, 2018.
- Jiang et al. [2019] Xiaolong Jiang, Zehao Xiao, Baochang Zhang, Xiantong Zhen, Xianbin Cao, David Doermann, and Ling Shao. Crowd counting and density estimation by trellis encoder-decoder networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6133–6142, 2019.
- Li et al. [2018] Yuhong Li, Xiaofan Zhang, and Deming Chen. Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1091–1100, 2018.
- Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
- Liu et al. [2019a] Chenchen Liu, Xinyu Weng, and Yadong Mu. Recurrent attentive zooming for joint crowd counting and precise localization. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1217–1226. IEEE, 2019.
- Liu et al. [2019b] Ning Liu, Yongchao Long, Changqing Zou, Qun Niu, Li Pan, and Hefeng Wu. Adcrowdnet: An attention-injective deformable convolutional network for crowd understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3225–3234, 2019.
- Liu et al. [2019c] Weizhe Liu, Mathieu Salzmann, and Pascal Fua. Context-aware crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5099–5108, 2019.
- Liu et al. [2019d] Yuting Liu, Miaojing Shi, Qijun Zhao, and Xiaofang Wang. Point in, box out: Beyond counting persons in crowds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6469–6478, 2019.
- Ramachandran et al. [2017] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Swish: a self-gated activation function. arXiv preprint arXiv:1710.05941, 7, 2017.
- Shi et al. [2019] Miaojing Shi, Zhaohui Yang, Chao Xu, and Qijun Chen. Revisiting perspective information for efficient crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7279–7288, 2019.
- Wan et al. [2019] Jia Wan, Wenhan Luo, Baoyuan Wu, Antoni B Chan, and Wei Liu. Residual regression with semantic prior for crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4036–4045, 2019.
- Wang and Breckon [2019] Qian Wang and Toby P Breckon. Segmentation guided attention network for crowd counting via curriculum learning. arXiv preprint arXiv:1911.07990, 2019.
- Wang et al. [2004] Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P Simoncelli, et al. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
- Wang et al. [2019] Qi Wang, Junyu Gao, Wei Lin, and Yuan Yuan. Learning from synthetic data for crowd counting in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8198–8207, 2019.
- Zhang et al. [2016] Yingying Zhang, Desen Zhou, Siqin Chen, Shenghua Gao, and Yi Ma. Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 589–597, 2016.