This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Crowd Localization from Gaussian Mixture Scoped Knowledge and Scoped Teacher

Juncheng Wang, Junyu Gao,  Yuan Yuan, 
and Qi Wang
J. Wang is with the School of Software, and with the School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University, Xi’an 710072, P. R. China. E-mail: [email protected]. Gao, Y. Yuan and Q. Wang are with the School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University, Xi’an 710072, P. R. China. E-mails: [email protected]; [email protected]; [email protected]. Wang and J. Gao are the corresponding authors.
Abstract

Crowd localization is to predict each instance head position in crowd scenarios. Since the distance of pedestrians being to the camera are variant, there exists tremendous gaps among scales of instances within an image, which is called the intrinsic scale shift. The core reason of intrinsic scale shift being one of the most essential issues in crowd localization is that it is ubiquitous in crowd scenes and makes scale distribution chaotic.

To this end, the paper concentrates on access to tackle the chaos of the scale distribution incurred by intrinsic scale shift. We propose Gaussian Mixture Scope (GMS) to regularize the chaotic scale distribution. Concretely, the GMS utilizes a Gaussian mixture distribution to adapt to scale distribution and decouples the mixture model into sub-normal distributions to regularize the chaos within the sub-distributions. Then, an alignment is introduced to regularize the chaos among sub-distributions. However, despite that GMS is effective in regularizing the data distribution, it amounts to dislodging the hard samples in training set, which incurs overfitting. We assert that it is blamed on the block of transferring the latent knowledge exploited by GMS from data to model. Therefore, a Scoped Teacher playing a role of bridge in knowledge transform is proposed. What’ s more, the consistency regularization is also introduced to implement knowledge transform. To that effect, the further constraints are deployed on Scoped Teacher to derive feature consistence between teacher and student end.

With proposed GMS and Scoped Teacher implemented on four mainstream datasets of crowd localization, the extensive experiments demonstrate the superiority of our work. Moreover, comparing with existing crowd locators, our work achieves state-of-the-art via F1-measure comprehensively on four datasets.

Index Terms:
Congested Scenes Perception, Crowd Localization, Intrinsic Scale Shift.

I INTRODUCTION

Crowd analysis is a popular application to computer vision community and has achieved superb success, especially in crowd counting[1, 2, 3, 4, 5, 6, 7]. Crowd counting is a fundamental task, which estimates the sum counts of pedestrians. The mainstream pipelines produce the predicted counts by directly regressing a scalar[8] or integrating the density distribution[3]. The above methods cannot yield an accurate location for each instance in crowd scenes, especially in congested regions. Recently, some researchers focus on crowd instance localization [9, 10, 11, 12], which aims to locate the center of the head for each person. Its instance-level predictions can provide more detailed information than traditional counting algorithms, and it aids some high-level crowd analysis tasks more effectively, such as crowd tracking [13], group detection [14].

Refer to caption
Figure 1: Crowd scenes with intrinsic scale shift. To facilitate visualization, we transfer the image into the gray mode and distinguish boxes with different scales in colors.

However, the crowd locator is still a challenging task for its instance-level objective. In crowd scenes[11, 10, 15], the instances are represented distinctly in scale within the image, where the representation inconsistence is called intrinsic scale shift. The intrinsic scale shift blamed on the different distances for instances from the camera becomes an essential issue in crowd localization for being ubiquitous in crowd scenarios. Fig. 1 depicts some typical examples for intrinsic scale shift. The boxes with variant scales are annotated in different colors. The intrinsic scale shift makes crowd locator struggle and insufficient to catch instances with variant scales. Precisely, it is arduous for crowd locators to converge on the non-independent identically distributed (i.i.d.) data, while the scale distribution of data with intrinsic scale shift can be recognized as the chaotic distribution. Thus, it is imperative to address the intrinsic scale shift in crowd localization.

In crowd counting, which is a related but more mature field comparing with crowd localization, the intrinsic scale shift has been attacked with two mainstreams. To begin with model perspective, designing a scale-aware model tackles the intrinsic scale shift in certain. SAS Net [1] proposes a fusion strategy among feature maps with different resolutions to aggregate different scales. Despite that the scale-aware models yield a certain promotion, manual proposal to the model architecture is hard to catch certain scale information in the wild. Therefore, the second stream is from data perspective which is to align the intrinsic scale shift. SD Net [16] aligns scale shift among orderly divided image patches. However, orderly dividing images ignores the scale variance within the patch. Moreover, the semantic information is distorted in the marginal region of the patches due to patch level dividing. In crowd localization, this semantic distortion of instances laying in the marginal region degrades performance. To this end, the crowd locator RAZ Net [10] leverages a recurrent method to find a region with smallest scales and assign a scale factor to each recurrent layer. Nevertheless, it is challenging to find the smallest scales region without missing other comparing regions.

This paper aims to tackle the intrinsic scale shift via data regularization and knowledge transform for crowd localization. In data perspective, the intrinsic scale shift incurs the chaos of the scale distribution in crowd scenes. Thus, we propose a Gaussian Mixture Scope (GMS), which aligns the chaotic scale distribution and constrains the normalization of data. Specifically, a Gaussian mixture distribution is leveraged to adapt to the scale distribution. Through decoupling the feature within the mixture model, the distribution is separated into normal sub-distributions. To this end, the chaos within the scale distribution is mapped to the shift among the sub-distributions. In the light of the above shift, we utilize the scale alignment among sub-distribution, in which the comparison of probability distributions is geometrized. Concretely, in adapting the scale distribution via Gaussian mixture, constraining spatial feature as the one of the observation values to probability distribution provides spatial compactness to the sub-distributions. Therefore, the compactness makes it available to treat sub-distributions as image patches and align the shift via image interpolation.

Despite that geometrized constraining provides spatial compactness to sub-distributions, the decoupling to the scale distribution incurs certain semantic issues. Since the decision boundary is adaptive among sub-distributions, the decoupling is to adaptively cut images. With this cutting strategy implemented, the shift alignment via interpolation incurs semantic distorted for distinct scale factors. To this end, we propose a sub-distribution re-aggregation trick. In shift alignment, the images are kept as a whole and fed into crowd locator. The windows are shot from the result according to the corresponding sub-distribution. As a result, there incurs less influence for the undistorted images comparing with distorted ones.

With proposed GMS aligning scale shift and sub-distribution re-aggregation alleviating semantic distortion, the chaos in data distribution is regularized. However, directly implementing GMS in training phase to regular training data dislodges the hard samples. Thus, crowd locator cannot actively learn the knowledge, but passively receive the knowledge, which incurs overfitting on training set. We assert the GMS regularized data can be treated as exploiting latent knowledge. To further transfer the exploited latent knowledge from regularized data to model, a Scoped Teacher playing a role of bridge in knowledge transform is proposed. The Scoped Teacher introduces a new paradigm comparing with conventional learning from manually annotated ground truth in fully-supervised crowd localization. In training, the GMS regularized images are fed into Scoped Teacher to exploit the latent knowledge, which is hard to be derived from ground truth learning. To transfer the knowledge, a consistency loss is implemented. In this way, the student model gradually learns the Scoped Teacher exploited features and converges better.

In a nutshell, our contributions are four-fold:

  • Propose to tackle the crowd localization from the perspective of scale shift. We provide a novel scale distribution alignment which is to geometrize the issue and to implement it via image interpolation.

  • Present a Gaussian Mixture Scope (GMS) to make scale alignment via scale distribution decoupling and sub-distributions alignment. Moreover, we propose a sub-distribution re-aggregation trick to alleviate boundary semantic distortion in alignment.

  • Design a Scoped Teacher to make latent knowledge transform, which also addresses the overfitting incurred by GMS in direct training. Moreover, the Scoped Teacher is a new paradigm in fully-supervised crowd localization.

  • Quantitative results demonstrate that our proposed work achieves state-of-the-art on four main-stream datasets in crowd localization.

II RELATED WORKS

In this section, brief reviews on related works to our method are arrayed. Firstly, since intrinsic scale shift also exists in crowd counting and has been attacked by community, it is of service to review intrinsic scale shift in crowd counting. Secondly, we array the introduction of crowd localization works. At last, to make a distinction with other teacher-student models, we also analyze some representative works which adopt teacher-student architecture.

II-A Crowd Counting

As aforementioned, the counting community attacks intrinsic scale shift in two mainstreams. From the model perspective, a multitude of works [17, 18, 19, 20, 21, 22, 23] deal with intrinsic scale shift via multi-feature fusion. Moreover, some others [24, 2, 25, 26, 27] trace the essence of intrinsic scale shift namely perspective imaging, which utilizes the predicted perspective map as training strategy. Despite that perspective related works achieve certain promotion, we assert the intrinsic scale shift has not been aligned. To this end, Auto Scale [28] and L2SM [29] propose to scale the image patches according to density level. [30, 31, 32] also feed patches with distinct density level into CNN with different receptive fields. [7] presents the crowd flow to enhance counting performance with location flow. However, the density level cannot represent instance scale. Therefore, SD Net [16] introduces instance scale and use it to align scale shift. But SD Net fails to keep semantic information during handle and ignores intrinsic scale shift within the divided patches.

II-B Crowd Localization

Crowd localization aims to locate the precise position of each head shown in the image. The very first idea about the localization must be object detection [33, 34, 35]. TinyFaces [36] utilizes a detection based framework via the analysis to the impacts of scales, context semantic information and image resolution to locate the tiny faces. Following TinyFaces [36], some researchers [15, 37, 38, 39] make extending work in tackling intrinsic scale shift. However, due to the shortage of detection structure, the detection based methods still perform poorly under extremely congested scenarios. Thus, some researchers begin to utilize regression based crowd locator. RD Net [40] leverages depth information to generate spatial aware supervision map. But in mainstream datasets of crowd localization, the depth information is unavailable. Thus, [6] proposes to utilize fine-grained density map to make crowd localization. BL [12] proposes a location aware loss function to locate the crowds. But it fails to address intrinsic scale shift. [41, 42, 43, 44, 45] utilize instance segmentation to locate crowd. Especially, the instance segmentation locators introduce box annotation in regression. By this way, the instance scale information can be estimated. Thus, our method follows this baseline. What’ s more, there are still some other works concentrating on intrinsic scale shift of crowd localization. Auto Scale [28] proposes to estimate a density region and learn to zoom it. Similarly, RAZ Net [10] also proposes a selection strategy to select the density region. These zooming strategies cannot cater to multi-region variance.

II-C Teacher-Student Model

The original proposal of the teacher-student model serves transfer learning. [46, 47, 48, 49] utilize teacher-student model in Knowledge Distillation (KD). Actually, our Scoped Teacher model is inspired by KD, in which the teacher model plays a role in bridging data with student model. However, the teacher model in KD tends to utilize a larger teacher model to exploit latent knowledge. However, our Scoped Teacher model shares the same architecture with student model and the images fed into teacher model have been processed by GMS, in which the latent knowledge is not from model representation capacity but from GMS. In Semi-Supervised Learning (SSL), some researchers also introduce teacher model. [50, 51, 52, 53] introduce teacher model in doing Consistency Regularization. In [54], they introduce a momentum network to predict pseudo label for unannotated images, which is actually a teacher-student model. The teacher models used in SSL are inclined to predict pseudo labels for unannotated samples which tend to be coarse knowledge. Despite that the proposed teacher model also aims to use Consistency Regularization, our target is to transfer fine-grained knowledge not coarse knowledge which has been learned by student crowd locator with annotation.

III METHODOLOGY

Refer to caption
Figure 2: Schematic illustration of our proposed framework. To begin with, we divide pipeline into three branches. The left one denotes the proposed GMS, in which the image is processed before fed into crowd locator. The up stream presents the student end, where the original images are fed. The down stream is the teacher end, in which the images processed by GMS are fed. Finally, a consistency regularization is adopted.

Overview. This paper aims to tackle the intrinsic scale shift in crowd localization. As shown in Fig. 2, we propose a Gaussian Mixture Scope (GMS) to regularize the crowd images to exploit latent scale knowledge. Then, the regularized and original images are fed into proposed Scoped Teacher and student model to make localization prediction. Then, a sub-distribution re-aggregation is to recompose the scoped prediction. Finally, a consistency loss between predictions by Scoped Teacher and student model is introduced to make knowledge transform. Section III-A reviews the previous Instance Segmentation method, which is our baseline method. Section III-B is for scale alignment process namely proposed GMS and sub-distribution re-aggregation trick. Section III-C describes Scoped Teacher model and knowledge transform. Section III-D gives a summary on our training objective.

III-A Instance-Segmentation Crowd Locator

The popular density map regression method in crowd counting cannot provide precise spatial information. Therefore, some researchers [41, 42] introduce to segment instance-head to make crowd localization. Concretely, they utilize a fixed and global threshold to transfer regressed confidence map activated by a sigmoid function into binary map, which is not robust. Therefore, IIM [43] proposes an additional and trainable pixel level threshold map to binarize confidence map.

Formally, given an image ori3×H×W\mathcal{I}_{ori}\in\mathbb{R}^{3\times H\times W}, in which the footnote oriori represents the original resolution images, a confidence map ori1×H×W\mathcal{F}_{ori}\in\mathbb{R}^{1\times H\times W} is predicted, see Eq. 1,

𝟎1×H×Wori𝟏1×H×W,\mathbf{0}^{1\times H\times W}\leq\mathcal{F}_{ori}\leq\mathbf{1}^{1\times H\times W}, (1)

where 𝟎\mathbf{0} and 𝟏\mathbf{1} denote the tensor filled with 0/10/1. Additionally, for the fixed threshold works, the segmented binary map orifix1×H×W\mathcal{B}_{ori}^{fix}\in\mathbb{R}^{1\times H\times W} is obtained through Eq. (2):

orifix(h,w)={1, if ori(h,w)ε0, others ,\mathcal{B}_{ori}^{fix}\left(h,w\right)=\left\{\begin{array}[]{lr}1,&\text{ if }\mathcal{F}_{ori}\left(h,w\right)\geq\varepsilon\\ 0,&\text{ others }\end{array}\right., (2)

where ε\varepsilon is a fixed threshold and v,hv,h are the pixel coordinates. As for the IIM, the binary map oriapt1×H×W\mathcal{B}_{ori}^{apt}\in\mathbb{R}^{1\times H\times W} is obtained through a trainable threshold map 𝒯1×H×W\mathcal{T}\in\mathbb{R}^{1\times H\times W} as Eq. (3):

oriapt(h,w)={1, if ori(h,w)𝒯(h,w)0, others .\mathcal{B}_{ori}^{apt}\left(h,w\right)=\left\{\begin{array}[]{lr}1,&\text{ if }\mathcal{F}_{ori}\left(h,w\right)\geq\mathcal{T}\left(h,w\right)\\ 0,&\text{ others }\end{array}\right.. (3)

In Eq. 3, it is obvious that the process is non-differentiable. Thus, [43] proposes to relax it to provide a gradient which can be described as Eq. 4:

𝒯n+1=𝒯n+αseg,\mathcal{T}_{n+1}=\mathcal{T}_{n}+\alpha\frac{\partial\mathcal{L}_{seg}}{\partial\mathcal{B}}, (4)

where α\alpha is the learning rate and seg\mathcal{L}_{seg} is formulated as Eq. 6.

With the adaptative threshold map 𝒯\mathcal{T}, a robust binary map oriapt\mathcal{B}_{ori}^{apt} is derived. The training strategy is formulated as Eq. (6):

seg =\displaystyle\mathcal{L}_{\text{seg }}= 1HWh=1Hw=1W(ori(h,w)^(h,w)2+\displaystyle\frac{1}{H\cdot W}\sum_{h=1}^{H}\sum_{w=1}^{W}(\left\|\mathcal{F}_{ori}(h,w)-\mathcal{\widehat{B}}(h,w)\right\|^{2}+ (5)
oriapt(h,w)^(h,w)1),\displaystyle\left\|\mathcal{B}^{apt}_{ori}(h,w)-\mathcal{\widehat{B}}(h,w)\right\|^{1}), (6)

where ^1×H×W\mathcal{\widehat{B}}\in\mathbb{R}^{1\times H\times W} is the ground-truth binary map of image ori\mathcal{I}_{ori}. By this way, a precise binary map is derived. Therefore, we follow the mentality of IIM as our baseline work. To clarify the paper, we omit the aptapt in oriapt\mathcal{B}_{ori}^{apt} in the following.

III-B Gaussian Mixture Scope

In instance segmentation crowd localization, since the locator derives the supervision signal from binary map with implicit scale information, in which the head-areas are annotated as the foreground, the locator is fragile and sensitive to instance scale shift. Moreover, the performance of locator is limited for being hard to catch large and tiny instances simultaneously. We assert that the issue can be blamed on the chaotically distributed scales. In the light of deep model trained via Empirical Risk Minimization (ERM), it is arduous for crowd locator to converge on data not satisfing with independent identically distributed conditional assumptions. Summarizing the above analysis, the regularization for the chaotic scale distribution can be the point to tackle the intrinsic scale shift.

To this end, we propose to decouple the chaotic scale distribution into several regular sub-distributions. Therefore, the chaos within the scale distribution is transferred into the distribution shift among sub-distributions. Additionally, we constrain the spatial feature to be correlated with scale distribution in decoupling. By this way, the sub-distributions are compact in spatial features, and the sub-distributions alignment is available to be implemented via image interpolation.

Specifically and formally, given an image ori\mathcal{I}_{ori} with NN pedestrians, in which the footnote oriori represents the image in the original resolution, and the corresponding scale distribution 𝒮ori\mathcal{S}_{ori} is shown as Eq. (7):

𝒮ori=1Ni=1Nδ(αi),\mathcal{S}_{ori}=\frac{1}{N}\sum_{i=1}^{N}\delta(\alpha_{i}), (7)

where αi\alpha_{i} is the scale for ithi^{th} instance and δ\delta denotes one-dimension Dirac function. We utilize a Gaussian mixture distribution to adapt to the 𝒮ori\mathcal{S}_{ori} as Eq. (8):

𝒮oriPr(α,v|θ)=c=1Cπc𝒩(α,v|θ),\mathcal{S}_{ori}\sim\text{Pr}(\alpha,v|\theta)=\sum_{c=1}^{C}\pi_{c}\mathcal{N}(\alpha,v|\mathbf{\theta}), (8)

where the mixture distribution is composed of cc sub-Gaussian distributions 𝒩()\mathcal{N}(\cdot) with parameters θ\theta which are mean and variance in Gaussian distribution, and probability πc\pi_{c}. The vv is vertical location to the instance. The mixture model is initialized and updated from the Expectation Maximization, which has an objective function of:

Θ=argmaxΘncλn,clnπc𝒩(αn,vn|θc)λn,c,\Theta=\arg\max_{\Theta}\sum_{n}\sum_{c}\lambda_{n,c}\ln{\frac{\pi_{c}\mathcal{N}(\alpha_{n},v_{n}|\theta_{c})}{\lambda_{n,c}}}, (9)

in which Θ\Theta is the set of {πc,θc|c=1,,C}\{\pi_{c},\theta_{c}|c=1,...,C\} and λ\lambda is defined as Eq. 10:

λn,c=πc𝒩(αn,vn|θc)cπc𝒩(αn,vn|θc).\lambda_{n,c}=\frac{\pi_{c}\mathcal{N}(\alpha_{n},v_{n}|\theta_{c})}{\sum_{c}\pi_{c}\mathcal{N}(\alpha_{n},v_{n}|\theta_{c})}. (10)

As aforementioned, to facilitate alignment, the mixture model should be correlated with spatial feature in decoupling. Therefore, we only adopt the vertically spatial feature vv to reduce the computational complexity from 𝒪(n2)\mathcal{O}(n^{2}) to 𝒪(n)\mathcal{O}(n). As for the horizontal ones, we demonstrate the redundance of it in scale feature representation, see Section IV-C5. Practically, the fine-grained scale information is unavailable in all existing datasets. Therefore, the annotated box area α\alpha is adopted as the observation value to the mixed distribution.

After decoupling the mixture distribution, the chaotic scale distribution is decomposed into cc normal distributions, which seems to be feasible for model to converge. What’ s more, constraining the vertical features in adaptation and decoupling, the sub-distributions are compact spatially. Hence, cc patches are derived, where each one has a scale distribution of normal sub-distribution in Eq. (8). To this end, the issue in intrinsic scale shift is to align the scale shift among sub-distributions. Thus, we introduce some prior knowledge, in which an optimal scale α0\alpha_{0} is set as the landmark to align the scale shift among sub-distributions. For each patch pcp_{c}, the aligned one p^c\widehat{p}_{c} is derived via Eq. (11) with an interpolation:

pc^=𝐼𝑛𝑡𝑒𝑟(pc,i=0Ncαiαo(Nc1)),\widehat{p_{c}}=\mathit{Inter}(p_{c},\frac{\sum_{i=0}^{N_{c}}\alpha_{i}}{\alpha_{o}\cdot(N_{c}-1)}), (11)

where NcN_{c} is the count of instance in pcp_{c}. Note that to avoid computational cost, we make compromise on using average scale of patches. Since the decoupling provides the compactness of scale within the sub-distribution namely the patch, the average scale is adequate to represent the patch.

Finally, the scale shift is aligned in two levels which are inter-patch and intra-patch. However, the sub-distributions are still discrete. There are two kinds of normal process, in which one is to directly splice them and make padding on smaller ones, while the other is to keep them being discrete. Nevertheless, in sub-distributions alignment, the patches are interpolated via distinct scale factors, it is unavoidable for the junction region being distorted semantically. Moreover, since the decoupling is adaptive, the decision boundary is uncertain, which incurs that some to be detected instances could be cut off and distorted. To alleviate the issue, we propose a sub-distribution re-aggregation trick.

Refer to caption
Figure 3: Depiction on the pipeline of sub-distribution re-aggregation. The transparent windows are geometrized sub-distributions.

Re-Aggregation for Sub-Distributions. In aforementioned two processes to discrete sub-distributions, the uncertainty of decision boundary for decoupling incurs the risk for instances being cut off. Given an image or patch with instances cut off, the locator cannot catch the semantic and detect them. Thus, we argue that it is necessary for locator to be fed with whole image. As the Fig. 3 shown, let scp\mathcal{I}_{scp} be the re-aggregated image. To meet the argument, the Eq. (12) must hold:

γ,scpInter(ori,γ).\exists\gamma\in\mathbb{R},\mathcal{I}_{scp}\equiv Inter(\mathcal{I}_{ori},\gamma). (12)

Therefore, given cc patches with cc counts corresponding scale factors {α1,α2,,αc}\left\{\alpha_{1},\alpha_{2},\cdots,\alpha_{c}\right\} which are from Eq. (11), we interpolate the ori\mathcal{I}_{ori} with αi\alpha_{i} for cc times and derive Eq. (13):

{scpi=Inter(ori,αi)i=0,1,,c}.\left\{\mathcal{I}_{scp}^{i}=Inter(\mathcal{I}_{ori},\alpha_{i})\mid i=0,1,\cdots,c\right\}. (13)

Then, the {scpii=0,1,,c}\left\{\mathcal{I}_{scp}^{i}\mid i=0,1,\cdots,c\right\} are fed into locator which makes it catch correctly semantic information. To obtain the final prediction scpre\mathcal{B}^{re}_{scp} with spatially semantic mapping relation to original image ori\mathcal{I}_{ori}, the predicted scpi\mathcal{B}^{i}_{scp} from scpi\mathcal{I}_{scp}^{i} are re-interpolated via {1α1,1α2,,1αc}\left\{\frac{1}{\alpha_{1}},\frac{1}{\alpha_{2}},\cdots,\frac{1}{\alpha_{c}}\right\} and the ROI is shot according to pcp_{c}. The finaly prediction is composed of shot ROIs.

III-C Scoped Teacher

In Section III-B, the Gaussian Mixture Scope (GMS) is proposed to regularize the chaotic scale distribution. Given a set of images \mathcal{I} with chaotic scale distribution, the crowd locator represents \mathcal{I} into latent space, in which the partial instances with certain scales are caught and partial backgrounds are mis-embedded. The GMS aligns scale facilitating feature embeddings with variant scale representation to be mapped to the same latent space. We name the predicted knowledge as 𝒦gt\mathcal{K}_{gt} which denotes that they come from ground truth trained model via ERM. The GMS processes \mathcal{I}, in which the outliers not in 𝒦gt\mathcal{K}_{gt} are represented to the same latent space and conform 𝒦GMS\mathcal{K}_{GMS}. In training phase, GMS exploits the latent 𝒦GMS\mathcal{K}_{GMS} to make crowd locator catch the outliers better. However, despite that the implementation of GMS provides additional 𝒦GMS\mathcal{K}_{GMS} to crowd locator, it is not the active learning, but the passive reception for relationship between 𝒦GMS\mathcal{K}_{GMS} and locator. As a result, the training phase provides annotation for GMS to exploit 𝒦GMS\mathcal{K}_{GMS} and aid locator to perform better, while the annotations are agnostic in testing phase and there is no 𝒦GMS\mathcal{K}_{GMS} which incurs the locator perform poorly with only representation capacity for 𝒦gt\mathcal{K}_{gt}. What’ s more in backpropagation phase, GMS dose not have a gradient to compute and its process is non-differentiable. Thus, there should be another better way to deploy GMS and make locator actively learn the 𝒦GMS\mathcal{K}_{GMS}.

To transfer the exploited 𝒦GMS\mathcal{K}_{GMS}, we propose a Scoped Teacher which is a teacher-student framework. Specifically, given an ori\mathcal{I}_{ori}, the student locator has a prediction of ori\mathcal{F}_{ori} and ori\mathcal{B}_{ori} which are confidence map and binary map. As for teacher end, the GMS is adopted to regularize the ori\mathcal{I}_{ori}. Then, the processed scp\mathcal{I}_{scp} is fed into teacher locator to aggregate 𝒦GMS\mathcal{K}_{GMS} and 𝒦gt\mathcal{K}_{gt}. The scpre\mathcal{B}_{scp}^{re} is from further proceeding of sub-distribution re-aggregation. Then, a consistency loss is introduced as Eq. (14) to transfer 𝒦gt\mathcal{K}_{gt} from Scoped Teacher to student locator.

consis=\displaystyle\mathcal{L}_{\text{consis}}= 1HWh=1Hw=1W(ori(h,w)scp(h,w)2+\displaystyle\frac{1}{H\cdot W}\sum_{h=1}^{H}\sum_{w=1}^{W}(\left\|\mathcal{F}_{ori}(h,w)-\mathcal{B}_{scp}(h,w)\right\|^{2}+
ori(h,w)scp(h,w)1).\displaystyle\left\|\mathcal{B}_{ori}(h,w)-\mathcal{B}_{scp}(h,w)\right\|^{1}). (14)

In consistency regularization, the Scoped Teacher adopts GMS to exploit 𝒦GMS\mathcal{K}_{GMS} and restore it in the representation of scpre\mathcal{B}_{scp}^{re}. With Eq. 14, the consistency constraint makes the ori\mathcal{F}_{ori} and ori\mathcal{B}_{ori} be closer to scpre\mathcal{B}_{scp}^{re}. By this way, during training, the back propagation of consistency loss pushes the 𝒦GMS\mathcal{K}_{GMS} being transferred from Scoped Teacher to student model.

Comparing with ground truth supervision, the improvement Scoped Teacher generated is more than GMS exploited knowledge 𝒦GMS\mathcal{K}_{GMS} transform. In settings of Scoped Teacher, we leverage a shared architecture with student locator. Empirically, to utilize a larger model in teacher end designing could make teacher with stronger representation capacity guide weaker student training. However, our Scoped Teacher shares the same architecture to student model, which means the outputs between student and teacher ends are with more consistent. Therefore, the guidance of Scoped Teacher to student is feasible to implement consistency regularization.

Finally, in parameters updating, the student crowd locator is trained via gradient descend. To aggregate the knowledge and stable knowledge transform, the teacher parameters ϑt\vartheta_{t} is updated via Exponential Moving Average (EMA) with student parameters ϑs\vartheta_{s} as Eq (15):

ϑtmϑt+(1m)ϑs,\vartheta_{t}\leftarrow m\vartheta_{t}+(1-m)\vartheta_{s}, (15)

where mm denotes the EMA decay coefficient to control the updating rate.

III-D Objective

Instance Segmentation Loss. The instance segmentation loss is a L2 loss for confidence map regularization and a L1 loss for binary map regularization as Eq. (6).

Consistency Regularization Loss. Since the gradient is detached in threshold learner, to optimize the threshold learning, L1 loss is used for binary map regularization between teacher and student end as Eq. (14).

Total Loss. During training, the student model is jointly trained in an end-to-end manner. The whole parameters are updated by integrating all mentioned loss functions:

total=seg+consis.\mathcal{L}_{total}=\mathcal{L}_{seg}+\mathcal{L}_{consis}. (16)

The teacher model is optimized as Eq. (15).

III-E Inference

At the testing or inference phase, the original image ori\mathcal{I}_{ori} is fed into student model, which can be described as Eq. 17:

res=f(ori;ϑt).\mathcal{B}_{res}=f(\mathcal{I}_{ori};\vartheta_{t}). (17)

Therefore, our proposed method would not incur any extra costs in inference.

IV Experiment

IV-A Datasets

  • 1

    Shanghai Tech Part A (SHHA): SHHA [55] contains 482 images where 270 are available for training, 30 are for validation and others are for test. There are 241, 677 instances annotated in total.

  • 2

    Shanghai Tech Part B (SHHB): SHHB [55] consists of 716 images where 360 are prepared for training, 40 are for validation and others are for test. SHHB has 88, 488 annotated instances.

  • 3

    NWPU-Crowd (NWPU): NWPU [11] is the largest dataset in crowd analysis community so far, in which there contains 5,109 images with 3,109 of them for training, 500 for validation and 1,500 for test.

  • 4

    UCF-QNRF (QNRF): QNRF [9] is a dataset with extremely congested scenarios, where it is composed of 1,535 images and 961 of them are for training, 240 are for validation and others are for test.

IV-B Implementation Details and Metrics

In the training phase, backbone networks of VGG-16 [56] and HR-Net [57], a batch size of 6, an optimizer of Adam [58] with learning rates 1e-5 for backbone and 1e-6 for threshold encoder, a learning rate decay of 0.99 for every epoch are adopted, an interpolation mode of Bilinear. In the testing phase, the tested images are fed into student locator in original scale and the model with best performance on validation set is picked for testing. Moreover, our experiments are applied on two NVIDIA RTX 3090 with a total memory of 48 GB.

Following [43, 11], the Precision (Pre.) , Recall (Rec.) and F1-measure (F1-m) are adopted for localization metrics as Eq. (18),

Pre.\displaystyle Pre. =TPTP+FP,\displaystyle=\frac{TP}{TP+FP},
Rec.\displaystyle Rec. =TPTP+FN,\displaystyle=\frac{TP}{TP+FN},
F1-m\displaystyle F1\text{-}m =2PreRecPre+Rec,\displaystyle=\frac{2\cdot Pre\cdot Rec}{Pre+Rec}, (18)

where F1-m is the core norm and TP,TN,FP,FNTP,TN,FP,FN denote True Positive, True Negative, False Positive and False Netgative. The MAE, MSE and NAE are adopted for counting metrics as Eq. (19),

MAE\displaystyle MAE =1Ni=1Nziz^i1,\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\left\|z_{i}-\widehat{z}_{i}\right\|^{1},
MSE\displaystyle MSE =1Ni=1Nziz^i2,\displaystyle=\sqrt{\frac{1}{N}\sum_{i=1}^{N}\left\|z_{i}-\widehat{z}_{i}\right\|^{2}},
NAE\displaystyle NAE =1Ni=1Nziz^i1zi.\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\frac{\left\|z_{i}-\widehat{z}_{i}\right\|^{1}}{z_{i}}. (19)

IV-C Analysis on Our Method

IV-C1 Ablation study

In this section, our method is decomposed into components to exploit each contribution. Individually, we only implement our proposed GMS in the inference phase to explore whether it is effective to align the intrinsic scale shift. Then, the same teacher model called Plain Teacher in Tab. I without GMS is introduced as demonstration in the training phase. The same consistency regularization loss is also utilized. Finally, our whole system is deployed.

TABLE I: Ablation study tested on SHHA-val.
Method Localization Counting
F1-m/Pre./Rec. (%) MAE/MSE
Baseline 67.0 /71.0 /63.4 119.5/ 242.1
GMS Inference 69.2/74.1/ 64.8 85.1 /164.8
Plain Teacher 69.1 /75.9/ 63.3 119.5 /260.4
Whole Method 71.4/73.6/ 69.3 81.7 /147.1

In Tab. I, baseline model is to leverage a pixel level threshold map to binarize confidence map into binary map. GMS Inference directly implements GMS in the inference phase and experiment shows that GMS promotes the localization performance by 2.2%\% and the counting performance by 34.4 on MAE. Therefore, GMS is indeed effective to align the intrinsic scale shift. Then, the Plain Teacher model with ensemble learning also has promotion which can be attributed to the knowledge aggregation. In the whole system implementation, our proposed method makes large promotion on both localization and counting.

IV-C2 Effect on knowledge transform

The proposed GMS is an off-line regularization strategy. In Section IV-C1, we demonstrate that directly implementing GMS in inference promotes the performance. However, the implementation of GMS requires ground truth of samples and incurs additional computing overhead. Thus, we propose Scoped Teacher to transfer the GMS exploited knowledge. To see if the knowledge GMS exploited transferred to student locator, we implement experiments on Tab. II.

TABLE II: Demonstration on Knowledge Transform.
Method Localization Counting
F1-m / Pre. / Rec. MAE / MSE
Base 67.0 / 71.0 / 63.4 119.5 / 242.1
Base + GMS 69.2 / 74.1/ 64.8 85.1 /164.8
Improvement +2.2 / +3.1 / +1.4 +34.4 / 77.3
Scoped 71.3 / 74.3 / 68.6 86.4 / 163.8
Scoped + GMS 71.6 / 74.9 / 68.4 79.5 / 152.5
Improvement +0.3 / +0.6 / -0.2 +6.9 / +11.3

In Tab. II, the Base denotes the baseline crowd locator, GMS denotes to implement GMS aligning testing data online. Scoped means the model is trained via our whole Scoped Teacher. Specifically, we implement GMS at testing time on locators knowledge transferred and without knowledge transferred. The results show that GMS regularization is effective in improving localization and counting performance to baseline model, in which there are marginal improvement obtained. However, there is slight influence when GMS is implemented on Scoped Teacher transferred locator. It is demonstrated that the Scoped Teacher transferred locator indeed learn the GMS exploited knowledge. The additionally deployed GMS is useless in knowledge extraction.

IV-C3 Choice of prior optimal scale

In GMS implementation, an optimal scale is introduced to compute an optimal interpolation factor for each sub-distribution. Intuitively, the optimal scale should be as large as possible. However, a considerable scale incurring large resolution is computational in convolution process. Moreover, image interpolation with a huge factor incurs serious non-semantic distortion. Therefore, some scales are selected to draw a finally optimal scale.

TABLE III: Choice on different optimal scales.
Optimal Scale Localization Counting
F1-m /Pre./ Rec.(%) MAE /MSE
100 67.9/ 75.2/61.9 111.9 /227.6
250 69.2/74.1/ 64.8 85.1 /164.8
500 68.9 /72.7 /65.6 76.9/156.4
1,000 67.6/ 68.9 /66.3 80.3/148.3
5,000 62.1/ 57.4/ 67.7 163.3/ 211.8

With the optimal scale being larger in Tab. III, the performance is not positively and correlatedly varying. We find a moderate scale is the best for performance promotion. For the smaller scale, the tiny instances are under zoomed. The locator tends to pay more attention on easy scales but ignore tiny instances, which is reflected on high precision, low recall and terrible counting performance. For the huge scales, we argue that it yields extreme distortion, which is shown on over-estimation. Thus, the Precision is low but Recall is high under huge scales. At last, the 250 and 500 are comparative. An optimal scale of 250 is finally chosen. This is because larger scale incurs higher computational complexity.

IV-C4 Comparison on three sub-distribution processing strategies

In this section, we compare three kinds of interpolation methods during inference phase to demonstrate the effect of proposed Re-Aggregation. Firstly, the image is divided into patches as GMS decoupled. Then, the patches are fed into crowd locator successively and the results have been arrayed as Patch Divide in Tab. IV. Secondly, based on Patch Divide, the patches are spliced into a hierarchically arrayed image, whose results have been arrayed as Patch Whole. Finally, our proposed Re-Aggregation is shown.

TABLE IV: Comparison among three kinds of sub-distribution processes.
Method Localization Counting
F1-m/Pre./Rec. (%) MAE/MSE
Baseline 67.0/71.0/63.4 119.5/242.1
Patch Divide 63.5/64.4/62.7 87.0/130.4
Patch Whole 68.2/68.8/67.7 76.3/118.27
Re-Aggregation 69.2/74.1/64.8 85.1/164.8

According to results, our Re-Aggregation performs best on F1-m but fails on counting performance. Therefore, we analyze the binary map from methods. We notice that in the marginal regions, the instances semantic information is distorted. To this end, the heads laying on the boundary line are divided into two parts. Thus, an additional prediction is generated. The counting results are higher which is closer than ground truth count. Moreover, for the imbalanced dividing, the patch with bigger part of heads cannot represent true position, which incurs the corresponding prediction to be recognized as False-Positive. Therefore, the Precision of Patch Divide and Patch Whole is even lower than Baseline model. In summary, our Re-Aggregation indeed alleviates the semantic distortion in the marginal region.

TABLE V: The leaderboard of NWPU-Crowd Localization (test set).
Method Backbone Overall Performance Scale Level
F1-m/Pre/Rec(%) MAE/MSE/NAE Avg. A0~A5
Tiny Faces ResNet-101 56.7/52.9/61.1 272.4/764.9/0.750 59.8 4.2/22.6/59.1/90.0/93.1/89.6
RAZ_Loc VGG-16 59.8/66.6/54.3 151.5/634.7/0.305 42.4 5.1/28.2/52.0/79.7/64.3/25.1
VGG+GPR VGG-16 52.5/55.8/49.6 127.3/439.9/0.410 37.4 3.1/27.2/49.1/68.7/49.8/26.3
IIM VGG-16 73.2/77.9/69.2 96.1/414.4/0.235 58.7 10.1/44.1/70.7/82.4/83.0/61.4
TopoCount VGG-16 69.2/68.3/70.1 107.8/438.5/- 63.3 5.7/39.1/72.2/85.7/87.3/89.7
AutoScale VGG-16 62.0/67.4/57.4 123.9/515.5/0.304 48.4 4.1/29.7/57.2/76.1/78.7/44.6
Ours VGG-16 74.3/80.8/68.7 102.9/446.8/0.245 60.3 10.7/42.6/69.8/83.3/86.2/69.0
Crowd-SDNet ResNet-50 63.7/65.1/62.4 -/-/- 55.1 7.3/43.7/62.4/75.7/71.2/70.2
FIDTM HR-Net 75.5/79.8/71.7 86.0/312.5/0.277 47.5 22.8/66.8/76.0/72.0/37.4/10.3
IIM HR-Net 76.2/81.3/71.7 87.1/406.2/0.152 61.3 12.0/46.0/73.2/85.5/86.7/64.3
DCST Swin-ViT 77.5/82.2/73.4 84.2/374.6/0.153 60.9 14.5/51.0/75.3/85.0/81.7/57.8
Ours HR-Net 78.1/79.8/76.5 84.7/361.5/0.232 66.7 17.1/54.1/78.0/88.0/90.6/72.3
Refer to caption
Figure 4: Pearson correlation coefficient on scale with two directions on four adopted datasets.

IV-C5 Why did only vertical features work

In crowd scenes, the scale distribution is inclined to be correlated with spatial distribution. This is caused by imagining process, the adjacent instances in physical space have similar scale in image. In our setting, the adaptation in scale distribution further introduces spatial feature to facilitate scale alignment via image interpolation. However, introducing spatial feature from two directions namely vertical and horizontal ones incurs a computational complexity of 𝒪(nvnh)\mathcal{O}(n_{v}\cdot n_{h}), where nvn_{v} and nhn_{h} denote the number of sub-distributions in vertical and horizontal direction. From the point of saving training cost, we analyze how vital for some direction in scale representation. To this end, we introduce Pearson correlation coefficient to measure how correlated between scale with the two spatial features.

In Tab. 4, the correlation coefficients between scale with vertical feature and horizontal feature show that the horizontal feature is almost independent with scale. In adaptation, we aim to utilize spatial feature to represent scale. Thus, the horizontal feature is slight in our objective.

IV-D Comparison with State-of-the-Art Methods

In this section, four chosen datasets are grouped into three parts. NWPU and JHU are comprehensive dataset where the congested and sparse scenarios are all included. QNRF and SHHA are congested datasets, while SHHB is the sparse dataset.

IV-D1 Comparison with SOTA methods on comprehensive datasets

In this section, we compare our proposed method with SOTA methods on NWPU-Crowd.

Tab. V arrays the comprehensive results on Localization and Counting on NWPU. In Tab. V, the chosen methods are divided as their used backbone network for a fair comparison. The Scale Level norm is Recall value. A0~A5 denotes the instance-scale is in [100,10110^{0},10^{1}], (101,10210^{1},10^{2}], (102,10310^{2},10^{3}], (104,10510^{4},10^{5}] and (10510^{5}, +\infty). The bold text denotes the first place and the underlined text denotes the second place.The compared methods are TinyFaces[15], RAZ_\_Loc[10], VGG+GPR[59, 60], IIM[43], TopoCount[41], AutoScale[28], Crowd-SDNet[61], DCST[44]. Additionally, TinyFaces and Crowd-SDNet utilize [62] as backbone network. With comparing primary norms (F1-m and MAE), our proposed method achieves the first place on Localization performance (a F1-m of 78.1%\%). Furthermore, the Recall value on different scales is also arrayed. Tab. V shows that our work is the first or second place on almost all scales.

What’ s more, we intuitively depict localization results. Following [11], we pick three representative methods to compare with our methods. Concretely, TinyFaces [15] is the object detection crowd locator. FIDTM [63] is the density regression crowd locator. IIM [43] is the instance segmentation crowd locator. Fig. 5 illustrates four groups of typical samples, in which the 3114th3114^{th} is the low resolution scene, 3277th3277^{th} is the sparse scene, 3348th3348^{th} is the negative scene and 3375th3375^{th} is the extremely congested scene. Firstly, for the low resolution scene namely the region in the top of 3114th3114^{th}, our Secoped Teacher performs better than others, duo to the zoom strategy to the tiny scales. Secondly, the sparse scenes like 3277th3277^{th} tend to suffer more serious scale shift empirically. Thus, we surpass the others in an untrivial margin. Thirdly, our scale alignment dose not break the robustness under the negative scenes, i.e., 3348th3348^{th}. As last, in extremely congested scenes, the density regression based FIDTM performs best. In the instance segmentation based locator, the congested scenarios incur tremendous overlapping, so the performance is relatively poor.

Refer to caption
Figure 5: Qualitative results on the NWPU-Crowd validation set. The predicted TP, FN and FP are respectively denoted as green, red and magenta. The results on the top of each sample have a template of Pre./Rec.Pre./Rec.

IV-D2 Comparison with SOTA methods on Congested Datasets

In this section, we compare our proposed method with other five state-of-the-art crowd locators in two congested datasets (QNRF and SHHA). The compared locators are TinyFaces, RAZ Loc, LSC-CNN, IIM and DCST. Specifically, TinyFaces is trained via official project with default parameters. RAZ Loc is adopted from [11]. LSC-CNN and IIM also come from official implementation. The performance of DCST is from arxiv preprinted paper. The performance (Localization: F1-m, Precision, Recall; Counting: MAE and MSE) are arrayed in Tab. VI. In SHHA, our proposed method achieves first place on F1-m and second place on MAE. Comparing with instance segmentation crowd locators IIM and DCST, we outperform them (76.0%\% vs. 73.9%\% and 74.5%\%) only with VGG-16 being backbone network. In QNRF, our work achieves first place on localization and counting. Significantly, the VGG-16 version of our work surpasses Swin-Transformer [64] based DCST (72.6%\% vs. 72.4%\%).

TABLE VI: Comparison with SOTA methods on congested datasets.
Method Backbone QNRF SHHA
F1-m/Pre./Rec. (%) MAE/MSE F1-m/Pre./Rec. (%) MAE/MSE
TinyFaces ResNet-101 49.4/36.3/77.3 336.8/741.6 57.3/43.1/85.5 237.8/422.8
RAZ_Loc VGG-16 53.3/59.4/48.3 118.0/198.0 69.2/61.3/79.5 71.6/120.1
LSC CNN VGG-16 58.2/58.6/57.7 120.5/218.3 68.0/79.6/66.5 66.4/117.0
IIM VGG-16 68.8/78.2/61.5 160.6/290.0 72.5/72.6/72.5 83.6/164.2
Ours VGG-16 72.6/77.0/68.7 137.6/263.2 76.0/76.4/75.5 71.8/128.1
IIM HR-Net 72.0/79.3/65.9 142.6/261.1 73.9/79.8/68.7 69.3/138.7
DCST Swin-ViT 72.4/77.1/68.2 127.2/234.3 74.5/77.2/72.1 78.4/153.2
Ours HR-Net 75.5/77.9/73.4 104.4/197.4 78.1/81.7/74.9 68.8/138.6

IV-D3 Comparison with SOTA methods on Sparse Dataset SHHB

In this section, we list the results on sparse dataset SHHB. Tab. VII shows that we are the first place on F1-m (86.3%\%). Despite that there is a trace of backwardness on counting performance, we still derive a certain of improvement comparing with most related instance segmentation locator IIM. We dissert the crux to the baseline method. In segmentation localization, each predicted instance represents true semantic information, while the density map regressors cannot promise the responding value has the true semantic information. To be specific, a locator with high counting performance and low localization performance cannot be recognized as a good pedestrians learner. Moreover, there exists a contradiction phenomenon. With a higher localization precision, the more boxes proposals tend to be lost which incurs worse counting performance.

TABLE VII: Comparison with SOTA methods on sparse dataset.
Method Backbone SHHB
F1-m/Pre./Rec. (%) MAE/MSE
TinyFaces ResNet-101 71.1/64.7/79.0 -/-
RAZ Loc VGG-16 68.0/60.0/78.3 9.9/15.6
LSC CNN VGG-16 71.2/71.7/70.6 8.1/12.7
IIM VGG-16 80.2/84.9/76.0 22.1/44.4
Ours VGG-16 83.8/89.4/78.0 18.2/37.8
IIM HR-Net 86.2/90.7/82.1 13.5/28.1
DCST Swin-ViT 86.0/88.8/83.3 11.0/23.6
Ours HR-Net 86.3/91.9/81.2 16.0/33.5

IV-E Discussion

In this section, we discuss how GMS and Scoped Teacher improves the final performance based on experiments in Section IV. To facilitate clear discussion, we pick one typical sample from SHHA and visualize its predicted confidence maps, threshold maps and binary maps from models of baseline method, GMS inferred and Scoped Teacher transferred, as shown in Fig. 6.

Refer to caption
Figure 6: Visualization on typical sample from SHHA. The depicted information includes the confidence maps, threshold maps and binary maps predicted by baseline model (Base), GMS inference (GMS) and Scoped Teacher trained model (Scope).

To begin with Tab. I, we notice that directly implementing GMS in the testing time brings improvement. Thus, it demonstrates the effect of scale alignment. However, it goes as a common sense that the conventional image interpolation would not provide any additional information. To this end, we argue that the improvement by GMS comes from distribution regularization and other forms of knowledge exploitation. See the column of Confidence in Fig. 6, the red box selects a region filled with tiny instances. In the bottom of box, the GMS provides higher confidence than baseline. This is because the original representation of those instances with improved confidence is still in the latent space of crowd locator. Hence, this is the GMS exploited knowledge. Nevertheless, since there is only an effective resolution (dataset author provided resolution) of 1024 * 768 to the image in Fig. 6, the instances in the top of the red box have representations of outliers, which are still outliers after alignment via GMS. With confidence variance by GMS explained, we put concentration on threshold learning. See the black box in the Threshold column of Fig. 6, GMS brings unsmooth distribution to threshold map. Actually, the regularization of GMS does not introduce any influence on model parameters. In the right and left side of the black box, the two regions with obviously low thresholds should be negative. It is blamed on the poor robustness of the baseline model. Since the corresponding area in red box shows better prediction on confidence, the abnormal low threshold area in black box is hard to be explained by non-semantic distortion. In the Scoped Teacher training, GMS exploits the wrong prediction actively to teacher model to correct them, see black box in the Scope row of Fig. 6. Thus, this is also the GMS exploited knowledge.

Then, we discuss the effect of the Scoped Teacher. There may be interests on how the knowledge transfers and what role of the Scoped Teacher plays beyond bridge in transform. Similarly, beginning from confidence analysis namely Confidence column in Fig. 6, the red boxes select our ROI. We assert that Scoped Teacher guides student to build a connection between normal tiny representations with GMS aligned representations. This is because the student locator fed with original outliers is inclined to make prediction being similar with teacher fed with mapped embeddings and the process makes implicit mapping transform from outliers embedding to normal embedding. What’ s more, there is another interesting phenomenon that the confidences are higher in red boxes when comparing Scope column with GMS column. We argue that the Scoped Teacher makes a further aggregation to the knowledge. The scope of GMS is limited within the temporary input. However, the Scoped Teacher learns to build the connection from all similar representation in the training set. Thus, the knowledge in confidence is the connection and the role Scoped Teacher plays is the connector and aggregator. Then, we analyze the Threshold learning. See Threshold column in Fig. 6, the black boxes also select our ROI. We notice that the threshold from Scope is more smooth than GMS. In consistency regularization, the negative regions outputs FP samples which exposes the un-robustness of locator and the corresponding loss is deployed to optimize the un-robustness. What’ s more, since GMS treats distribution discretely, it makes images lost physical features. The Scoped Teacher aids locator learn GMS exploited useful information and ignore the issues incurred by losing physical feature. Thus, the knowledge transferred is the punishment on un-robustness and the role Scoped Teacher plays is a filter to select useful knowledge. Finally, we put analysis on Binary column of Fig. 6. See the yellow boxes, the Scope depicts less boxes than GMS. And the recall comparison shows there are more boxes are removed, while the precision comparison shows there are more accurate boxes predicted. Thus, the knowledge transferred is box refinement and the role Scoped Teacher plays is the refiner.

V Conclusions

This paper aims to tackle the essential issue, intrinsic scale shift in crowd localization. Specifically, we propose to regularize the chaotic scale distribution to align scale shift. Gaussian Mixture Scope (GMS) is proposed to implement the scale distribution regularization which is from distribution decoupling and alignment among sub-distributions perspective. Moreover, the GMS introduces spatial feature in regularization facilitating to geometrize the alignment which can thus be deployed via image interpolation. To further address the semantic distortion incurred by adaptative decoupling, we propose a novel sub-distribution re-aggregation strategy. What’ s more, a Scoped Teacher model with corresponding consistency regularization is further introduced to transfer knowledge from GMS processed data to locator and be a novel manner to implement GMS to make locator actively learn the knowledge. The proposed GMS is remarkably visible in improving localization performance. The Scoped Teacher model bridges between data with model to aid the implementation of GMS in training phase and promote final localization results. Extensive experiments show that the proposed work achieves state-of-the-art on popular datasets of the crowd localization. In the future, we will discuss how to align the average scale shift among datasets namely extrinsic scale shift, which is to locate the crowds towards the open-set.

References

  • [1] Q. Song, C. Wang, Y. Wang, Y. Tai, C. Wang, J. Li, J. Wu, and J. Ma, “To choose or to fuse? scale selection for crowd counting,” in Proceedings of AAAI Conference on Artificial Intelligence, 2021, pp. 2576–2583.
  • [2] Z. Yan, Y. Yuan, W. Zuo, X. Tan, Y. Wang, S. Wen, and E. Ding, “Perspective-guided convolution networks for crowd counting,” in Proceeding of the IEEE International Conference on Computer Vision, 2019, pp. 952–961.
  • [3] J. Gao, Q. Wang, and X. Li, “Pcc net: Perspective crowd counting via spatial convolutional network,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 10, pp. 3486–3498, 2019.
  • [4] J. Cheng, H. Xiong, Z. Cao, and H. Lu, “Decoupled two-stage crowd counting and beyond,” IEEE Transactions on Image Processing, vol. 30, pp. 2862–2875, 2021.
  • [5] M. Ling and X. Geng, “Indoor crowd counting by mixture of gaussians label distribution learning,” IEEE Transactions on Image Processing, vol. 28, no. 11, pp. 5691–5701, 2019.
  • [6] J. Wan, N. S. Kumar, and A. B. Chan, “Fine-grained crowd counting,” IEEE transactions on image processing, vol. 30, pp. 2114–2126, 2021.
  • [7] W. Liu, M. Salzmann, and P. Fua, “Counting people by estimating people flows,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2021.
  • [8] Q. Wang, J. Wang, J. Gao, Y. Yuan, and X. Li, “Counting like human: Anthropoid crowd counting on modeling the similarity of objects,” arXiv preprint arXiv:2212.02248, 2022.
  • [9] H. Idrees, M. Tayyab, K. Athrey, D. Zhang, S. Al-Maadeed, N. Rajpoot, and M. Shah, “Composition loss for counting, density map estimation and localization in dense crowds,” in Proceeding of the European Conference on Computer Vision, 2018, pp. 532–546.
  • [10] C. Liu, X. Weng, and Y. Mu, “Recurrent attentive zooming for joint crowd counting and precise localization,” in Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 1217–1226.
  • [11] Q. Wang, J. Gao, W. Lin, and X. Li, “Nwpu-crowd: A large-scale benchmark for crowd counting and localization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 6, pp. 2141–2149, 2020.
  • [12] J. Wan, Z. Liu, and A. B. Chan, “A generalized loss function for crowd counting and localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1974–1983.
  • [13] W. Ren, X. Wang, J. Tian, Y. Tang, and A. B. Chan, “Tracking-by-counting: Using network flows on crowd density maps for tracking multiple targets,” IEEE Transactions on Image Processing, vol. 30, pp. 1439–1452, 2020.
  • [14] R. Sanford, S. Gorji, L. G. Hafemann, B. Pourbabaee, and M. Javan, “Group activity detection from trajectory and video data in soccer,” in Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 898–899.
  • [15] Y. Bai, Y. Zhang, M. Ding, and B. Ghanem, “Finding tiny faces in the wild with generative adversarial network,” in Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 21–30.
  • [16] Z. Ma, X. Hong, X. Wei, Y. Qiu, and Y. Gong, “Towards a universal model for cross-dataset crowd counting,” in Proceeding of the IEEE International Conference on Computer Vision, 2021, pp. 3205–3214.
  • [17] L. Liu, Z. Qiu, G. Li, S. Liu, W. Ouyang, and L. Lin, “Crowd counting with deep structured scale integration network,” in Proceeding of the IEEE International Conference on Computer Vision, 2019, pp. 1774–1783.
  • [18] L. Liu, H. Wang, G. Li, W. Ouyang, and L. Lin, “Crowd counting using deep recurrent spatial-aware network,” in Proceedings of the International Joint Conference on Artificial Intelligence, 2018, pp. 849–855.
  • [19] Z. Ma, X. Wei, X. Hong, and Y. Gong, “Learning scales from points: A scale-aware probabilistic model for crowd counting,” in Proceedings of the ACM International Conference on Multimedia, 2020, pp. 220–228.
  • [20] Z.-Q. Cheng, J.-X. Li, Q. Dai, X. Wu, J.-Y. He, and A. G. Hauptmann, “Improving the learning of multi-column convolutional neural network for crowd counting,” in Proceedings of the ACM International Conference on Multimedia, 2019, pp. 1897–1906.
  • [21] D. Onoro-Rubio and R. J. López-Sastre, “Towards perspective-free object counting with deep learning,” in Proceeding of the European Conference on Computer Vision, 2016, pp. 615–629.
  • [22] S. Bai, Z. He, Y. Qiao, H. Hu, W. Wu, and J. Yan, “Adaptive dilated network with self-correction supervision for counting,” in Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 4594–4603.
  • [23] Y. Liu, Q. Wen, H. Chen, W. Liu, J. Qin, G. Han, and S. He, “Crowd counting via cross-stage refinement networks,” IEEE Transactions on Image Processing, vol. 29, pp. 6800–6812, 2020.
  • [24] M. Shi, Z. Yang, C. Xu, and Q. Chen, “Revisiting perspective information for efficient crowd counting,” in Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 7279–7288.
  • [25] C. Zhang, H. Li, X. Wang, and X. Yang, “Cross-scene crowd counting via deep convolutional neural networks,” in Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 833–841.
  • [26] Y. Yang, G. Li, D. Du, Q. Huang, and N. Sebe, “Embedding perspective analysis into multi-column convolutional neural network for crowd counting,” IEEE Transactions on Image Processing, vol. 30, pp. 1395–1407, 2020.
  • [27] W. Liu, K. Lis, M. Salzmann, and P. Fua, “Geometric and physical constraints for drone-based head plane crowd density estimation,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2019, pp. 244–249.
  • [28] C. Xu, D. Liang, Y. Xu, S. Bai, W. Zhan, X. Bai, and M. Tomizuka, “Autoscale: Learning to scale for crowd counting,” International Journal of Computer Vision, pp. 1–30, 2022.
  • [29] C. Xu, K. Qiu, J. Fu, S. Bai, Y. Xu, and X. Bai, “Learn to scale: Generating multipolar normalized density maps for crowd counting,” in Proceeding of the IEEE International Conference on Computer Vision, 2019, pp. 8382–8390.
  • [30] U. Sajid and G. Wang, “Plug-and-play rescaling based crowd counting in static images,” in Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 2020, pp. 2287–2296.
  • [31] U. Sajid, H. Sajid, H. Wang, and G. Wang, “Zoomcount: A zooming mechanism for crowd counting in static images,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 10, pp. 3499–3512, 2020.
  • [32] D. Babu Sam, S. Surya, and R. Venkatesh Babu, “Switching convolutional neural network for crowd counting,” in Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5744–5752.
  • [33] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779–788.
  • [34] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137–1149, 2017.
  • [35] R. Stewart, M. Andriluka, and A. Y. Ng, “End-to-end people detection in crowded scenes,” in Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2325–2333.
  • [36] P. Hu and D. Ramanan, “Finding tiny faces,” in Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 951–959.
  • [37] Z. Li, X. Tang, J. Han, J. Liu, and R. He, “Pyramidbox++: High performance detector for finding tiny face,” arXiv preprint arXiv:1904.00386, 2019.
  • [38] J. Li, Y. Wang, C. Wang, Y. Tai, J. Qian, J. Yang, C. Wang, J. Li, and F. Huang, “Dsfd: Dual shot face detector,” in Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 5060–5069.
  • [39] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceeding of the IEEE International Conference on Computer Vision, 2017, pp. 2980–2988.
  • [40] D. Lian, J. Li, J. Zheng, W. Luo, and S. Gao, “Density map regression guided detection network for rgb-d crowd counting and localization,” in Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 1821–1830.
  • [41] S. Abousamra, M. Hoai, D. Samaras, and C. Chen, “Localization in the crowd with topological constraints,” in Proceedings of AAAI Conference on Artificial Intelligence, 2021.
  • [42] C. Arteta, V. Lempitsky, and A. Zisserman, “Counting in the wild,” in Proceeding of the European Conference on Computer Vision, 2016, pp. 483–498.
  • [43] J. Gao, T. Han, Y. Yuan, and Q. Wang, “Learning independent instance maps for crowd localization,” arXiv preprint arXiv:2012.04164, 2021.
  • [44] J. Gao, M. Gong, and X. Li, “Congested crowd instance localization with dilated convolutional swin transformer,” Neurocomputing, vol. 513, pp. 94–103, 2022.
  • [45] Q. Wang, T. Han, J. Gao, Y. Yuan, X. Li et al., “Ldc-net: A unified framework for localization, detection and counting in dense crowds,” arXiv preprint arXiv:2110.04727, 2021.
  • [46] G. Hinton, O. Vinyals, J. Dean et al., “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
  • [47] Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le, “Self-training with noisy student improves imagenet classification,” in Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 687–10 698.
  • [48] C. Yang, L. Xie, S. Qiao, and A. L. Yuille, “Training deep neural networks in generations: A more tolerant teacher educates better students,” in Proceedings of AAAI Conference on Artificial Intelligence, 2019, pp. 5628–5635.
  • [49] C. Yuan, Z. Zhong, C. Lei, X. Zhu, and R. Hu, “Adaptive reverse graph learning for robust subspace learning,” Information Processing & Management, vol. 58, no. 6, p. 102733, 2021.
  • [50] K. Sohn, D. Berthelot, N. Carlini, Z. Zhang, H. Zhang, C. A. Raffel, E. D. Cubuk, A. Kurakin, and C.-L. Li, “Fixmatch: Simplifying semi-supervised learning with consistency and confidence,” in Advances in Neural Information Processing Systems, 2020, pp. 596–608.
  • [51] Q. Zhou, Z. Feng, Q. Gu, J. Pang, G. Cheng, X. Lu, J. Shi, and L. Ma, “Context-aware mixup for domain adaptive semantic segmentation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 2, pp. 804–817, 2023.
  • [52] Y. Yang, S. Wang, P.-A. Heng, and L. Yu, “Hcdg: A hierarchical consistency framework for domain generalization on medical image segmentation,” arXiv preprint arXiv:2109.05742, 2021.
  • [53] Y. Zhu, J. Ma, C. Yuan, and X. Zhu, “Interpretable learning based dynamic graph convolutional networks for alzheimer’s disease analysis,” Information Fusion, vol. 77, pp. 53–61, 2022.
  • [54] N. Araslanov and S. Roth, “Self-supervised augmentation consistency for adapting semantic segmentation,” in Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 15 384–15 394.
  • [55] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma, “Single-image crowd counting via multi-column convolutional neural network,” in Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 589–597.
  • [56] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations, 2015.
  • [57] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang et al., “Deep high-resolution representation learning for visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 10, pp. 3349–3364, 2020.
  • [58] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations, 2015.
  • [59] J. Gao, T. Han, Q. Wang, and Y. Yuan, “Domain-adaptive crowd counting via inter-domain features segregation and gaussian-prior reconstruction,” IEEE Transactions on Neural Networks and Learning Systems, 2019.
  • [60] J. Gao, W. Lin, B. Zhao, D. Wang, C. Gao, and J. Wen, “C^ 3 framework: An open-source pytorch code for crowd counting,” arXiv preprint arXiv:1907.02724, 2019.
  • [61] Y. Wang, J. Hou, X. Hou, and L.-P. Chau, “A self-training approach for point-supervised object detection and counting in crowds,” IEEE Transactions on Image Processing, vol. 30, pp. 2876–2887, 2021.
  • [62] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  • [63] D. Liang, W. Xu, Y. Zhu, and Y. Zhou, “Focal inverse distance transform maps for crowd localization,” IEEE Transactions on Multimedia, 2022.
  • [64] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceeding of the IEEE International Conference on Computer Vision, 2021, pp. 10 012–10 022.