This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Open-Set Face Identification
on Few-Shot Gallery by Fine-Tuning

Hojin Park, Jaewoo Park, and Andrew Beng Jin Teoh School of Electrical and Electronics Engineering
College of Engineering, Yonsei University
Seoul, Korea
[email protected], [email protected], [email protected]
Abstract

In this paper, we focus on addressing the open-set face identification problem on a few-shot gallery by fine-tuning. The problem assumes a realistic scenario for face identification, where only a small number of face images is given for enrollment and any unknown identity must be rejected during identification. We observe that face recognition models pretrained on a large dataset and naively fine-tuned models perform poorly for this task. Motivated by this issue, we propose an effective fine-tuning scheme with classifier weight imprinting and exclusive BatchNorm layer tuning. For further improvement of rejection accuracy on unknown identities, we propose a novel matcher called Neighborhood Aware Cosine (NAC) that computes similarity based on neighborhood information. We validate the effectiveness of the proposed schemes thoroughly on large-scale face benchmarks across different convolutional neural network architectures. The source code for this project is available at: https://github.com/1ho0jin1/OSFI-by-FineTuning

I Introduction

Refer to caption
(a)
Refer to caption
(b)
Figure 1: (a) Full fine-tuning all parameters severely degrades the OSFI performance, while our method significantly improves the pre-trained model. Detection & Identification Rate (DIR) [1] quantifies both correct identification of the known probe identities and detection of the unknown. (b) An outline of our proposed fine-tuning scheme: Given a model pretrained on a large-scale face database, we initialize the gallery set classifier by weight imprinting, and then fine-tune the model on a few-shot gallery set by training only the BatchNorm layers. In the evaluation stage, a given probe is either accepted as known or rejected as an unknown identity based on novel similarity matcher dubbed Neighborhood Aware Cosine (NAC) matcher.

Recently face recognition (FR) has achieved astonishing success attributed to three factors in large. Deep convolutional neural network (CNN) architectures [2, 3] that have strong visual prior were developed and leveraged as FR embedding models. Large-scale face datasets [4, 5] that cover massive identities with diverse ethnicity and facial variations became available. On top of these, various metric learning losses [6, 7, 8, 9] elevated the performance of deep FR models to an unprecedented level.

The majority of FR embedding models have been evaluated on numerous benchmarks with closed-set identification [7, 8, 9, 10, 11]. The closed-set identification protocol assumes all probe identities present in the gallery. However, in a realistic scenario, an unknown identity that is not enrolled may be encountered. Another important but practical aspect to consider is the scarcity of intra-class samples for the gallery identities to be registered; namely, due to the expensive data acquisition cost and privacy issue, only a very small number of samples might be available for each gallery identity to register. In this respect, open-set face identification (OSFI) with the small-sized gallery is closer to a real scenario as it needs to perform both known probe identity identification and unknown probe identity rejection based on the limited information from the small gallery set. Despite its versatile practical significance, however, OSFI with a small gallery has been rarely explored.

Devising a model specific to OSFI with a small gallery can be challenging in the following aspects: Firstly, an OSFI model performs both identifications of a known probe identity but also correct rejection of unknown probe identity. Hence, conventional FR embedding models devised mainly for closed-set identification can perform poorly at the rejection of the unknown. In fact, as observed in Fig. 1 (a), FR embedding models pretrained on a large-scale public face database are not effective for open-set identification, leaving room for improvement. This suggests the need for fitting the pretrained model to be more specific to the given gallery set.

Secondly, due to the few-shot nature of the small-sized gallery set, there is a high risk of overfitting for fine-tuning the pretrained model. As shown in Fig. 1 (a), full fine-tuning (i.e. updating all parameters) of the pretrained model results in severe performance degradation. This drives us to devise an overfitting-resilient parameter tuning scheme.

Moreover, an ordinary cosine similarity matcher used in the closed-set identification might have a large tradeoff between the known probe identity identification and unknown probe identity rejection. As will be observed in Sec. III-D, the simple cosine matcher has a severe drawback for the task at hand. This motivates us to devise a robust matcher for OSFI.

Based on these observations, we propose an efficient fine-tuning scheme and a novel similarity-based matcher for OSFI constrained on a small gallery set. Our fine-tuning scheme consists of weight initialization of the classifier governed by weight imprinting (WI) [12] and training only BatchNorm (BN) layers [13] for overfitting-resilient adaptation on the small gallery set. Moreover, for both effective detection of the unknown and identification of the known probe identities, a novel Neighborhood Aware Cosine (NAC) matcher that respects the neighborhood information of the learned gallery features, and hence better calibrates the rejection score is proposed. Our contributions are summarized as follows:

  1. 1.

    To effectively solve the OSFI problem constrained on a small gallery set, we propose to fine-tune the pretrained face embedding model. Since full fine-tuning deteriorates the embedding quality, we search for the optimal method.

  2. 2.

    We demonstrate that the combination of weight imprinting and exclusive BatchNorm layer fine-tuning excels other baselines.

  3. 3.

    We recognize that the commonly used cosine similarity is a sub-optimal matcher for rejection. We propose a novel matcher named NAC that significantly improves the rejection accuracy.

II Related Works

II-A Open Set Face Identification (OSFI)

[14], one of the earliest works in OSFI, used their proposed Open-set TCM-kNN on top of features extracted by PCA and Fisher Linear Discriminant. [15] proposed their own OSFI protocol and showed that an extreme value machine [16] trained on the gallery set performs better than using cosine similarity or linear discriminant analysis for matchers. [17] trained a matcher composed of locality sensitive hashing [18] and partial least squares[19]. [20] applied OpenMax [21] and PROSER [22], two methods for open-set recognition of generic images, on top of extracted face features.

All previous works propose to train an open-set classifier (matcher) of some form, but all of them use a fixed encoder. To the best of our knowledge, we are the first to propose an effective fine-tuning scheme as a solution to OSFI.

II-B Cosine Similarity-based Loss Functions

[23] proposed to l2l_{2}-normalize the features such that the train loss is only determined by the angle between the feature and the classifier weights. [7] further extended this idea by applying a multiplicative margin to the angle between a feature and its corresponding weight vector. This penalized the intra-class features to be gathered while forcing inter-class centers (prototypes) to be separated. A number of follow-up papers such as [8, 9, 10, 11] modify this angular margin term in different ways, but their motivations and properties are generally similar. Therefore, in our experiments we only use CosFace loss [8] as a representative method. For comprehensive understanding of these loss functions, refer to [24].

III Approach

Our proposed approach is two-fold: fine-tuning on the gallery and open-set identification evaluation. In the fine-tuning stage, the classifier is initialized by weight imprinting to initiate learning from optimal discriminative features, and the model is fine-tuned by updating only the BatchNorm layers to avoid overfitting on the few-shot gallery data. In evaluation, we utilize a novel matcher NAC that computes a neighborhood aware similarity for better-calibrated rejection of the unknown. We demonstrate that the combination of these three methods significantly outperforms all other baselines.

III-A Problem Definition and Metrics

Formally, in an OSFI problem, we assume the availability of an encoder ϕ\phi pretrained on a large-scale face database (FR embedding model), which is disjoint from the evaluation set with respect to identity. The evaluation set consists of a gallery G={(xiG,yiG)}i=1CmG=\{(x^{G}_{i},y^{G}_{i})\}^{Cm}_{i=1} and a probe set QQ. The probe set QQ is further divided into the known probe set K={(xiK,yiK)}K=\{(x^{K}_{i},y^{K}_{i})\} and the unknown probe set U={(xiU,yiU)}U=\{(x^{U}_{i},y^{U}_{i})\}. GG and KK has no overlapping images xx but shares same identities y{1,,C}y\in\{1,...,C\} whereas UU has disjoint identities, i.e., 𝒴U{1,,C}=Ø\mathcal{Y}^{U}\cap\{1,...,C\}=\O. mm refers to the number of images per identity in GG, which we fix to 3 to satisfy the few-shot constraint. We allow the encoder to be fine-tuned over the gallery set.

The evaluation of OSFI performance uses the detection and identification rate at some false alarm rate (DIR@FAR). FAR=1 means we do not reject any probe. Note that unlike the general case shown in [1], here we only consider rank-1 identification rate for DIR. Therefore, DIR@FAR=1 is the rank-1 closed-set identification accuracy.

III-B Classifier Initialization by Weight Imprinting

Due to the few-shot nature of the gallery set where we fine-tune on, the initialization of model parameters and, in particular, of classifier weights is crucial to avoid overfitting. The most naive option is a random initialization of the classifier weight matrix WW. Another commonly used strategy is linear probing [25], i.e., finding an optimized weight WW that minimizes the classification loss over the frozen encoder embeddings ϕ(x)\phi(x).

We experimentally find that, as seen in Fig. 2, both of these initialization schemes do not induce discriminative structure for the encoder embedding ϕ(x)\phi(x). In particular, during fine-tuning, each weight vector wcw_{c} in the classifier acts as a center (or prototype) for the cc-th class (i.e. identity). Fig. 2 shows that neither random initialization nor linear probing of wcw_{c} derives optimally discriminative weight vectors wcw_{c}, resulting in low quality of class separation of gallery features.

Motivated from this issue, we propose to initialize by weight imprinting (WI), which induces the optimal discriminative quality for the gallery features:

wc=wc^wc^2,wc^=1myiG=cϕ(xiG)w_{c}=\frac{\widehat{w_{c}}}{\lVert\widehat{w_{c}}\rVert_{2}},\quad\widehat{w_{c}}=\frac{1}{m}\sum_{y^{G}_{i}=c}\phi(x^{G}_{i}) (1)

where 2\lVert\cdot\rVert_{2} is the l2l_{2} norm, and the embedding feature ϕ(x)\phi(x) is unit-normalized such that ϕ(x)2=1\lVert\phi(x)\rVert_{2}=1.

As expected, Fig. 2 verifies that fine-tuning from the weight imprinted initialization achieves a much higher discriminative quality. This shows the superiority of weight imprinting compared to random initialization and linear probing.

Note that weight imprinting has been frequently used in FR embedding models [8, 9]. However, the critical difference is that those models utilize weight imprinting only to prepare templates before evaluation. In our case, the WI initialization is utilized particularly for fine-tuning.

III-C BatchNorm-only Fine-Tuning

Choosing the appropriate layer to tune is another important issue for fine-tuning. Moreover, due to the extremely small number of samples for each gallery identity, there is a risk of overfitting as suggested by the classical theory on the vc dimension [26]. In fact, a recent study [25] suggests that full fine-tuning hurts the pretrained filters including the useful convolutional filters learned from a large-scale database.

To minimize the negative effect of this deterioration, we fine-tune only the BatchNorm (BN) layers along with the classifier weight:

minW,θBN(WTϕθ(x),y),θ=[θBN,θrest]\min_{W,\,\theta_{BN}}\mathcal{L}(W^{T}\phi_{\theta}(x),y),\quad\theta=[\theta_{BN},\theta_{rest}] (2)

where θ\theta refers to all parameters in the encoder ϕ=ϕθ\phi=\phi_{\theta} and θBN\theta_{BN} and θrest\theta_{rest} respectively refers to BatchNorm parameters and the rest. During fine-tuning, θrest\theta_{rest} is fixed with no gradient flow. The loss function \mathcal{L} can be a softmax cross-entropy, or widely used FR embedding model losses such as ArcFace [9] and CosFace [8].

Due to selective fine-tuning of only the BN layers (and classifier weight), the convolutional filters learned from the large-scale pre-train database are simply transferred. The BN-only training is thus computationally efficient as it occupies only 0.1-0.01% of the total parameters in the CNN. Nevertheless, its model complexity is sufficient to learn a general image task as guaranteed by [27].

Refer to caption
Figure 2: The Intra-class variance (left) and inter-class separation (right) of classifiers that are initialized by different schemes. NormFace [23], CosFace [8] and ArcFace [9] loss are used for linear probing initialization. The weight imprinting initialization does not require training, thus stays constant.
Refer to caption
Figure 3: An unknown feature uu placed between gallery prototypes of class ii and jj. ϵ\epsilon is some small positive constant.
TABLE I: Average angle (degrees) between IJB-C probe feature vectors and their top-k closest gallery prototypes. The third column refers to the average of top-2 to top-16.
Encoder top-1 top-2 2\sim16
Res50 K 50.7 64.0 69.1
U 63.8 66.0 69.7
VGG19 K 53.4 66.2 71.4
U 65.9 68.2 72.1

III-D Neighborhood Aware Cosine Similarity

The cosine similarity function is the most predominant matcher for contemporary face verification and identification. Denoting the probe feature vector as pp and the gallery prototypes as {gj}j=1C\{g_{j}\}_{j=1}^{C}, where gj:=1myiG=jϕ(xiG)g_{j}:=\frac{1}{m}\sum_{y^{G}_{i}=j}\phi(x^{G}_{i}) is the mean of all the normalized gallery feature vectors of class jj, identification is performed by finding the maximum class index c=argmaxj=1,,Ccos(p,gj)c=\arg\max_{j=1,\dots,C}\cos(p,g_{j}). On the other hand, in the extension to OSFI, the decision of accepting as known or rejecting as unknown can be formulated:

maxj=1,,Ccos(p,gj)RejectAcceptτ\max_{j=1,\dots,C}\cos(p,g_{j})\underset{\text{Reject}}{\overset{\text{Accept}}{\gtrless}}\tau (3)

where cos(p,q)=pp2qq2\cos(p,q)=\frac{p}{\lVert p\rVert_{2}}\cdot\frac{q}{\lVert q\rVert_{2}} is the cosine similarity between two feature vectors, τ\tau is the rejection threshold.

Now, consider an example illustrated in Fig. III-C. The cosine matcher will assign the probe uu to the identity ii with the acceptance score 0.8660.866, which is fairly close to the maximum score 11. This value alone might imply that the probe is a known sample as it is close to the gallery identity ii. However, the probe feature vector is placed right in the middle of the identities ii and jj. The in-between placement of uu suggests that the probe can be possibly unknown and thus should be assigned with a lesser value of the acceptance score.

Motivated by this intuition, we propose the Neighborhood Aware Cosine (NAC) matcher that respects all top-kk surrounding gallery features:

NAC(p,gi)=exp(cos(p,gi))1[iNk]jNkexp(cos(p,gj))\text{NAC}(p,g_{i})=\frac{\exp(cos(p,g_{i}))\cdot\text{1}[i\in N_{k}]}{\sum_{j\in N_{k}}\exp(\cos(p,g_{j}))} (4)

Here, NkN_{k} is the index set of kk gallery prototypes that are nearest to the probe feature pp, and 1 is the indicator function. The main goal of the NAC matcher is to improve the unknown rejection. Table III-C shows that known probe features are much closer to their closest prototype than the second-closest prototype, unlike unknown probes. By exploiting this phenomenon, the NAC matcher is able to assign a much smaller score to unknown probe, as shown in Fig. 4.

Refer to caption
Figure 4: The distributions of scores for known (K) and unknown (U) probes of IJB-C dataset using cosine similarity (left) and NAC with k=16k=16 (right). The scores are min-max normalized and τ\tau is set such that FAR=0.01 for both cases. DIR=48.05% (left) vs DIR=54.53% (right). ResNet-50 was used as the encoder.

IV Experiments

IV-A Datasets

We use VGGFace2 [4] dataset for pretraining the encoders, and CASIA-WebFace [28] and IJB-C [29] for evaluation. Using MTCNN [30], we align and crop every images to 112x112 with equal parameters for all datasets. For VGGFace2, we remove all identities overlapping with the evaluation datasets. The evaluation datasets are equally split into two groups such that the number of known and unknown identities are equal. Then we randomly choose m=3m{=}3 images of the known identities to create the gallery (G), and the rest are known probes (K). All images of unknown identities are unknown probes (U). Table II summarizes the statistics of the datasets we use. Note that we chose every known identity to have more than 10 images such that there can be at least 7 probe samples. Also note that IJB-C dataset consists of still images and video frames (video frames typically have poorer image quality). We sample the gallery from still images and probes from video frames, which makes this dataset much challenging. We note that the protocol devised here can be regarded as an extension of that in [15].

IV-B Baselines

IV-B1 Classifier Initialization

Along with Weight Imprinting (denoted WI), we report the results of using random initialization and linear probing initialization as described in Sec. III-B.

IV-B2 Encoder Layer Fine-Tuning

Along with BatchNorm-only fine-tuning (denoted as BN), we explore tuning other layers of the encoder. The simplest one is tuning every layer (i.e. all parameters of a model), which we denote as full. The second is freezing the early layers and training only the deeper ones, which we denote as partial. We also consider the parallel residual adapter [31], which adds additional 1x1 convolutional filters to the original convolutional layers. During fine-tuning, only these additional filters are trained to capture the subtle difference in the new dataset. Note that the authors in [31] apply this technique to ResNet [3], hence the name residual parallel adapter. But this idea can be generally applied to CNNs without residual connection, hence we also apply this to a VGG-style network. We denote this as PA, referring to Parallel Adapter.

TABLE II: Dataset Statistics. The number inside the parentheses refers to the average number of images per identity. For evaluation datasets, Known identities consist of the gallery (G) and known probe (K), where the gallery has 3 images per identity.
Pretrain # IDs (images / ID)
VGGFace2 7,689 (354.0)
Evaluation Known (G + K) Unknown (U)
CASIA-WebFace 5,287 (3+20.0) 5,288 (16.5)
IJB-C 1,765 (3+15.3) 1,765 (13.9)
TABLE III: The total number of parameters and number of fine-tuned parameters for each encoder fine-tuning scheme. ‘+’ refers to the number of added parameters for the Parallel Adapter.
# Params (million)
VGG19 Res50
Pretrained 32.88 43.58
Full fine-tuning 32.88 43.58
Partial fine-tuning 4.72 4.72
Parallel Adapter +2.22 +3.39
BN-only fine-tuning 0.01 0.03

IV-B3 Matcher

During OSFI evaluation, the vanilla cosine similarity matcher is adopted as the baseline matcher. When the NAC matcher is used, we denote by NAC. For comparison, we also use the extreme value machine (EVM) proposed by [15]. We train the EVM on the gallery set with the best parameters found by the authors.

In summary, classifier initialization methods we consider are {Random, Linear probing, WI}, fine-tuning layer configurations are {Full, Partial, PA, BN}, and matchers are {cos, EVM, NAC}. We test the OSFI performances among different combinations of these three components. Our proposed OSFI scheme is to use WI+BN+NAC jointly.

Refer to caption
Figure 5: The OSFI performance of cosine similarity and NAC with different values of kk on IJB-C dataset, using VGGNet-19 (left) and ResNet-50 (mid) as the encoder. The square markers refer to cosine similarity and star marks the optimal k for different layer fine-tuning methods. To summarize the OSFI performance into a single number, we used the area under the curve (AUC, %) of DIR@FAR curve. (Right) DIR@FAR curve of Pretrained and BN configuration using cosine similarity and NAC (k=16) as the matcher. Numbers in the legend show the AUC values. When k=1k=1, NAC is replaced by cos\cos.

IV-C Training Details

We choose VGG19 [2] and ResNet-50 [3] for the encoders with the feature dimension 512. We pretrain these encoders on the VGGFace2 dataset with CosFace with scale=32, margin=0.4 as loss function until convergence.

Then we fine-tune the encoder with different classifier initialization schemes and encoder layer configurations. When using the linear probing initialization, we train the classifier until the training accuracy reaches 95%.

We follow the encoder layer finetuning in Sec. IV-B. For the partial fine-tuning, we only train the last 2 convolutional layers (Conv-BN-ReLU-Conv-BN-ReLU). Table III shows the number of total and updated parameters for each fine-tuning scheme.

We fix the number of epochs to 20 and batch size to 128 for every method. We again use CosFace loss for consistency. For the optimizer we use Adam [32] with cosine annealing. The initial learning rate is set to 1e-4 for full and PA, and 1e-3 for partial and BN, which we find as the optimal learning rate for each method. For data augmentation, we use random horizontal flipping and random cropping with the random scale from 0.7 to 1.0. The cropped images are resized to the original size.

TABLE IV: DIR@FAR of different methods on CASIA-WebFace dataset and IJB-C dataset, using VGGNet-19 and ResNet-50 as the encoder. DIR@FAR=1 (100%) is the closed-set accuracy. The highest value in each column is marked in bold. For the first three rows the encoder is not fine-tuned and only the matchers are changed. The last row (WI+BN+NAC) is our proposed method.
Encoder Method CASIA-WebFace IJB-C
Classifier initialization Fine-tuning layers Matcher DIR @ FAR (%) DIR @ FAR (%)
0.1 1.0 10.0 100.0 0.1 1.0 10.0 100.0
VGG19 None None cos 25.23 52.97 70.07 80.89 28.35 45.55 61.71 73.80
None None EVM 37.57 57.75 71.03 80.78 35.03 53.64 63.34 73.70
None None NAC 25.15 55.68 71.41 80.89 36.73 51.92 64.27 73.80
Random Full cos 23.95 43.19 59.03 70.94 17.18 32.62 46.90 60.23
Linear probing Full cos 28.82 55.64 70.44 79.84 30.80 45.91 59.63 70.09
WI Full cos 27.63 57.58 72.02 80.94 35.49 50.52 63.56 73.53
WI Partial cos 28.91 57.31 72.29 81.16 34.81 51.98 64.53 73.89
WI PA cos 26.29 57.90 72.82 81.82 31.74 50.21 64.26 74.50
WI BN cos 25.39 56.65 72.54 82.14 32.19 48.74 63.87 74.43
WI BN NAC 25.94 58.01 72.92 82.14 38.09 53.08 65.30 74.43
Res50 None None cos 23.85 58.06 74.15 83.69 32.11 48.05 65.31 76.96
None None EVM 39.44 61.61 75.02 83.57 38.12 38.12 66.81 76.96
None None NAC 21.24 60.23 75.31 83.69 36.67 54.53 68.14 76.96
Random Full cos 25.31 45.43 60.80 72.44 14.88 32.05 49.39 61.88
Linear probing Full cos 28.35 60.11 74.63 82.73 30.35 46.42 61.90 72.34
WI Full cos 26.73 63.92 77.49 84.65 39.05 56.00 67.83 76.94
WI Partial cos 25.98 64.66 78.07 85.02 44.31 57.11 69.13 77.49
WI PA cos 24.89 63.85 77.58 85.01 36.69 54.86 68.30 77.63
WI BN cos 25.70 65.83 79.66 86.73 40.29 55.71 69.29 78.74
WI BN NAC 23.65 67.72 80.34 86.73 40.25 58.25 70.40 78.74

IV-D Optimal k for NAC

Since the gallery set is too small, we cannot afford a separate validation set to individually optimize kk for each dataset. Instead, we attempt to find a global value that has optimal performance regardless of the fine-tuning method, if one exists.

We first fine-tune the encoders with different layer configurations, which gives us five different encoders including one without any fine-tuning; pretrained, full, partial, PA, and BN. Then we search the best parameter k for the NAC matcher by grid search strategy, where the grid is [2,4,8,16,32,128,256,512,1024,CC], and CC is the total number of identities. Note that k=1k=1 refers to using cosine similarity instead of NAC, which we added for comparison. Since a single-value objective is preferred, we use the area under the curve (AUC) of the DIR@FAR curve instead of DIR value at different FAR values. We repeat this process with different datasets and encoder architectures.

The results are shown in Fig. 5. We did not include the results of CASIA-WebFace as it shows a similar trend. Excluding k=1k=1 which is not NAC, the results show a smooth unimodal curve with a peak at k=16k=16 or 3232. This shows that the NAC matcher indeed has a globally optimal kk value that is robust against different datasets, encoders, and fine-tune methods. Thus we choose k=16k=16 (k=32k=32 also gives similar results) as the global parameter throughout this paper.

Note that when k=Ck=C, NAC becomes equivalent to softmax function with cosine similarity logits. However, this is notably inferior compared to k=16k=16, which implies that considering only the kk-nearest is superior to considering every gallery prototype.

IV-E Comparison of Fine-Tuning Methods

We compare the OSFI performances of the pretrained model (non-fine-tuned) with six different combinations of classifier initialization schemes and layer finetuning configurations: random+full, linear probing+full, WI+full, WI+partial, WI+PA, WI+BN. The matcher is fixed to cosine similarity. These correspond to row 4-9 in Table IV.

First, to compare different classifier initialization schemes, we fix the fine-tuning scheme to full. When using random initialization, rejection accuracy (DIR@FAR=0.001,0.01,0.1) and closed-set accuracy (DIR@FAR=1) severely drops. For linear probing, rejection accuracy improves while closed-set accuracy drops. Only WI clearly improves the encoder performance, supporting the superiority of weight imprinting.

Now we fix the classifier initialization to WI and compare different layer finetuning configurations. full clearly has the worst performance. While PA is better than partial in closed-set accuracy, partial clearly outperforms PA in rejection accuracy. BN outperforms all others in closed-set accuracy with a large margin but sometimes falls behind partial in rejection accuracy.

With the aid of the NAC matcher, our method WI+BN+NAC outperforms all other methods in every aspect. Compared to original, this gains 4.60%, 8.11%, 4.57%, 1.68% higher DIR in average with respect to FAR of 0.001, 0.01, 0.1, 1.0, respectively.

TABLE V: Inter-class separation, intra-class variance, DBI, and AUC gain by using NAC (refer to Fig. 5) for each layer finetuning configuration. These values are averaged across datasets and encoder architectures. \uparrow means that larger quantity is better and vice versa.
Inter (\uparrow) Intra (\downarrow) DBI (\downarrow) Δ\DeltaAUC (\uparrow)
Pretrained Model 106.3 34.5 1.52 0.740
Full finetuning 106.7 24.2 0.87 0.025
Partial finetuning 106.4 24.5 0.90 0.058
Parallel Adapter 107.0 31.8 1.32 0.135
BN-only finetuning 107.3 33.6 1.46 0.335

IV-F Analysis on Discriminative Quality of Different Fine-tuning Methods

How do different layer finetuning configurations affect the final OSFI performance? To analyze this, we adopt three different metrics; inter-class separation, intra-class variance, and Davies-Bouldin Index (DBI) [33]. The definitions of the first two metrics are identical to that of Fig. 2. DBI is a metric for evaluating the clustering quality, where DBI 0\approx 0 means perfect clustering. We compute these metrics on the gallery features after fine-tuning, and the results are shown in Table V.

Here we can easily separate these configurations into two groups: full and partial vs PA and BN. The first group has similar inter-class separation with Pretrained and significantly smaller intra-class variance, which leads to small DBI. This is in stark contrast with the second group.

With this observation, we can conjecture the different optimization strategies of each group. The first group was able to easily reduce the training loss by collapsing the gallery features into a single direction (shown by the small angle between intra-class features). This was possible because both full and partial directly updated the parameters of the convolutional filters. On the other hand, all convolutional filters were frozen for both PA and BN. This constraint may have prevented these methods from taking the shortcut, i.e. simply collapsing the gallery features, and instead led to separating the embeddings of different identities. This explains why PA and BN have higher closed-set accuracy.

This can also explain the AUC gain (Δ\DeltaAUC) when using NAC instead of cosine similarity. Features become redundant when they collapse, and so does the prototype. Therefore the information from neighboring prototypes becomes less helpful in rejecting unknown samples, leading to the marginal gain from using NAC. This is why full and partial do not benefit from using NAC matcher.

Refer to caption
Figure 6: The performance of our method against the baseline w.r.t. different gallery size. AUC of DIR@FAR curve is used as the performance measure.

IV-G Performance with respect to Different Gallery Size

Fig. 6 shows the OSFI performance of our method against the baseline (pretrained encoder with cos matcher) with respect to different gallery size. We can see that our method consistently improves upon the baseline, except for the extreme case where only one image is provided for each identity.

V Conclusion and Future Works

In this work we showed that combining weight-imprinted classifier and BatchNorm-only tuning of the encoder effectively improves the encoder’s OSFI performance without suffering from overfitting. We further facilitated the performance by our novel NAC matcher instead of the commonly used cosine similarity. Future works will explore extending this idea to the open-set few-shot recognition of generic images.
Acknowledgements:
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (NO. NRF-2022R1A2C1010710)

References

  • [1] A. K. Jain and S. Z. Li, Handbook of face recognition.   Springer, 2011, vol. 1.
  • [2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [4] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “Vggface2: A dataset for recognising faces across pose and age,” in 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018).   IEEE, 2018, pp. 67–74.
  • [5] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, “Ms-celeb-1m: A dataset and benchmark for large-scale face recognition,” in European conference on computer vision.   Springer, 2016, pp. 87–102.
  • [6] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.
  • [7] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep hypersphere embedding for face recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 212–220.
  • [8] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu, “Cosface: Large margin cosine loss for deep face recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5265–5274.
  • [9] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4690–4699.
  • [10] X. Wang, S. Zhang, S. Wang, T. Fu, H. Shi, and T. Mei, “Mis-classified vector guided softmax loss for face recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 12 241–12 248.
  • [11] Q. Meng, S. Zhao, Z. Huang, and F. Zhou, “Magface: A universal representation for face recognition and quality assessment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 225–14 234.
  • [12] H. Qi, M. Brown, and D. G. Lowe, “Low-shot learning with imprinted weights,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5822–5830.
  • [13] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International conference on machine learning.   PMLR, 2015, pp. 448–456.
  • [14] F. Li and H. Wechsler, “Open set face recognition using transduction,” IEEE transactions on pattern analysis and machine intelligence, vol. 27, no. 11, pp. 1686–1697, 2005.
  • [15] M. Gunther, S. Cruz, E. M. Rudd, and T. E. Boult, “Toward open-set face recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017, pp. 71–80.
  • [16] E. M. Rudd, L. P. Jain, W. J. Scheirer, and T. E. Boult, “The extreme value machine,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 3, pp. 762–768, 2017.
  • [17] R. Vareto, S. Silva, F. Costa, and W. R. Schwartz, “Towards open-set face recognition using hashing functions,” in 2017 IEEE international joint conference on biometrics (IJCB).   IEEE, 2017, pp. 634–641.
  • [18] B. Kulis and K. Grauman, “Kernelized locality-sensitive hashing,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 6, pp. 1092–1104, 2011.
  • [19] G. Mateos-Aparicio, “Partial least squares (pls) methods: Origins, evolution, and application to social sciences,” Communications in Statistics-Theory and Methods, vol. 40, no. 13, pp. 2305–2317, 2011.
  • [20] H. Dao, D.-H. Nguyen, and M.-T. Tran, “Face recognition in the wild for secure authentication with open set approach,” in International Conference on Future Data and Security Engineering.   Springer, 2021, pp. 338–355.
  • [21] A. Bendale and T. E. Boult, “Towards open set deep networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1563–1572.
  • [22] D.-W. Zhou, H.-J. Ye, and D.-C. Zhan, “Learning placeholders for open-set recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4401–4410.
  • [23] F. Wang, X. Xiang, J. Cheng, and A. L. Yuille, “Normface: L2 hypersphere embedding for face verification,” in Proceedings of the 25th ACM international conference on Multimedia, 2017, pp. 1041–1049.
  • [24] I. Masi, Y. Wu, T. Hassner, and P. Natarajan, “Deep face recognition: A survey,” in 2018 31st SIBGRAPI conference on graphics, patterns and images (SIBGRAPI).   IEEE, 2018, pp. 471–478.
  • [25] Anonymous, “Fine-tuning distorts pretrained features and underperforms out-of-distribution,” in Submitted to The Tenth International Conference on Learning Representations, 2022, under review. [Online]. Available: https://openreview.net/forum?id=UYneFzXSJWh
  • [26] V. N. Vapnik and A. Y. Chervonenkis, “On the uniform convergence of relative frequencies of events to their probabilities,” in Measures of complexity.   Springer, 2015, pp. 11–30.
  • [27] J. Frankle, D. J. Schwab, and A. S. Morcos, “Training batchnorm and only batchnorm: On the expressive power of random features in cnns,” arXiv preprint arXiv:2003.00152, 2020.
  • [28] D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Learning face representation from scratch,” arXiv preprint arXiv:1411.7923, 2014.
  • [29] B. Maze, J. Adams, J. A. Duncan, N. Kalka, T. Miller, C. Otto, A. K. Jain, W. T. Niggel, J. Anderson, J. Cheney et al., “Iarpa janus benchmark-c: Face dataset and protocol,” in 2018 International Conference on Biometrics (ICB).   IEEE, 2018, pp. 158–165.
  • [30] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, 2016.
  • [31] S.-A. Rebuffi, H. Bilen, and A. Vedaldi, “Efficient parametrization of multi-domain deep neural networks,” 2018.
  • [32] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [33] D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEE transactions on pattern analysis and machine intelligence, no. 2, pp. 224–227, 1979.