This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Improved Deep Classwise Hashing With Centers Similarity Learning for Image Retrieval

Ming Zhang Department of Electrical Engineering
City University of Hong Kong
Hong Kong, China
[email protected]
   Hong Yan Department of Electrical Engineering
City University of Hong Kong
Hong Kong, China
[email protected]
Abstract

Deep supervised hashing for image retrieval has attracted researchers’ attention due to its high efficiency and superior retrieval performance. Most existing deep supervised hashing works, which are based on pairwise/triplet labels, suffer from the expensive computational cost and insufficient utilization of the semantics information. Recently, deep classwise hashing introduced a classwise loss supervised by class labels information alternatively; however, we find it still has its drawback. In this paper, we propose an improved deep classwise hashing, which enables hashing learning and class centers learning simultaneously. Specifically, we design a two-step strategy on center similarity learning. It interacts with the classwise loss to attract the class center to concentrate on the intra-class samples while pushing other class centers as far as possible. The centers similarity learning contributes to generating more compact and discriminative hashing codes. We conduct experiments on three benchmark datasets. It shows that the proposed method effectively surpasses the original method and outperforms state-of-the-art baselines under various commonly-used evaluation metrics for image retrieval.

Index Terms:
image retrieval, deep hashing, convolutional neural networks

I Introduction

In the big data era, there is an explosive growth of data uploaded to the Internet every day. How to search the high dimensional data in a large-scale database both efficiently and effectively becomes more important for various social platforms and search engines. Given a query image, the task of content-based image retrieval (CBIR) is finding all the related items to the query in the database according to their visual features. CBIR has witnessed great progress in the last decades. Suppose both images in the database and the query image are encoded in real-valued features. Thus, to find the closest items in the database to the query, it is realized by ranking the database items by their distance to the query in the feature space. However, for a database with millions of images, the computation and storage cost by using real-valued features are expensive. Moreover, it brings a high burden to the mobile or embedded devices where the resources are limited for both storage and latency.

As a powerful approximate nearest neighbor search technique, hashing has been widely employed for information retrieval. The goal of hashing is mapping high dimensional data into lower dimensional binary codes while preserving their original similarity relations of the input space. Preformed with the binary codes, it takes the advantages of extremely fast retrieval speed and low storage cost. Generally, existing data-dependent hashing methods can be categorized into unsupervised hashing [1, 2] and supervised hashing [3, 4, 5].Unsupervised hashing methods learn hashing functions in different distance measurements without considering the label information. Whereas, supervised hashing incorporates labels information e.g. pairwise labels into the hashing learning process, which is known to outperform the unsupervised hashing with the same code length.

With the great success of deep learning in the past years, deep supervised hashing [6, 7, 8, 9, 10] proposes to apply deep neural networks to extract feature representations for hashing functions learning. Among them, some works with end-to-end training manners have greatly improved the retrieval performance and shown superior results to traditional supervised hashing methods. Most existing deep supervised hashing utilizes the pairwise label information to examine the similarity between pairs of samples. A representative work is Deep Pairwise Supervised Hashing (DPSH) [6], which minimizes the Hamming distance between each pair of similar samples while maximizing the Hamming distance between each pair of dissimilar samples. The similar samples refer to the samples sharing at least one class and their similarity is denoted as ‘1’. While dissimilar samples refer to the samples sharing no common object class and their similarity is denoted as ‘0’ or ‘-1’. Inspired by DPSH, some variations are proposed, e.g., Deep Supervised Hashing with Triplet labels (DTSH) [11] and Deep Supervised Discrete Hashing (DSDH) [7].

The main problem of pairwise/triplet labels-based deep supervised hashing is the tremendous computation cost. The time complexity of the algorithm is O(n2)O(n^{2}), where nn is the number of samples in the training set. In practice, it is solved by a mini-batch sampling of inputs during training. However, this results in a side effect: the labeled data cannot be fully utilized because some supervised information is always discarded during iterative learning. Consequently, the learned binary hashing codes are suboptimal to preserve the semantics relation of the dataset. Some works have been proposed to overcome this issue, e.g., DAGH [8] utilizes the anchor graph to efficiently connect all the training samples and the anchors under the deep hashing framework.

Recently, some novel deep hashing methods propose directly use labels information as supervision. One typical example is Deep Classwise Hashing (DCWH) [10], which applies a classwise loss based on a normalized Gaussian distribution. DCWH introduces the class centers in the learning procedure with the essence of minimizing the Hamming distance between intra-class samples to the corresponding class center. Experiments show that DCWH outperforms prior pairwise/triplet labels-based deep hashing works especially on the datasets with a significant number of classes and small inter-class variations. Although DCWH guarantees that the average distance of intra-class samples to the corresponding class center is smaller than that to other centers. It still has its drawback. Concretely, when datapoints belonging to different classes are far from the corresponding class centers, they are prone to be closer to each other than their intra-class samples. We will demonstrate this issue in the next section in detail. To overcome these issues, we propose an improved deep classwise hashing method (IDCWH) with centers similarity learning to generate more compact and discriminative binary codes for image retrieval. The contributions of this paper are summarized as follows:

  • The proposed IDCWH enables networks to learn hashing codes and class centers simultaneously in an end-to-end manner. The dual part of the designed loss mutually contributes to the concentration of intra-class samples on the corresponding class center and it gives better training convergence.

  • We introduce a two-step strategy on centers similarity learning. It first estimates binary centers of each unique class from mini-batch data to represent the intra-class samples. Then, it dynamically attracts the corresponding learnable center to be close to the estimated center while pushing other centers away in the Hamming space.

  • Extensive experiments on three benchmark datasets show the proposed IDCWH outperforms other compared state-of-the-art baselines. It verifies the effectiveness of IDCWH on generating compact and semantics-preserving hashing codes for image retrieval.

II Related Works

Denote the dataset with NN training samples as X={xi}i=1NX=\{x_{i}\}_{i=1}^{N}. The label matrix of XX is given as Y=[y1,y2,,yN]C×NY=[y_{1},y_{2},\ldots,y_{N}]\in\mathbb{R}^{C\times{N}}, where CC is the number of unique classes in the dataset and yji=1{y}_{ji}=1 if xix_{i} owns the label of the jjth class, otherwise yji=0y_{ji}=0. Note that, yiy_{i} is not limited as a one-hot encoding vector, in order to handle the multi-label cases. The goal of supervised hashing is learning hashing functions to map XX into binary codes B={bi}i=1Nl×NB=\{b_{i}\}_{i=1}^{N}\in\mathbb{R}^{l\times{N}} with the supervision information in YY, where bi{1,1}lb_{i}\in\{-1,1\}^{l} and ll is the code length. Nowadays, deep supervised hashing [7, 6, 10, 8] applies deep neural networks f()f(\cdot) to achieve feature learning and hashing learning jointly. Thus, bi=f(Θ;xi)b_{i}=f(\Theta;x_{i}), where Θ\Theta represents the learnable parameters of the network.

II-A Deep Supervised Hashing with Pairwise Labels

Provided the label matrix YY, the pairwise similarity between any pair of samples xix_{i} and xjx_{j} in XX can be represented as sijs_{ij}, where sij=yiTyjs_{ij}=y_{i}^{T}y_{j}. Thus, sij=1s_{ij}=1 if xix_{i} and xjx_{j} shares at least one common label, while sij=0s_{ij}=0 otherwise. Most deep supervised hashing is based on pairwise/triplet labels [6, 7], which learns f()f(\cdot) to map XX from the original space into BB in Hamming space while maintaining the semantics similarity in sijs_{ij}. Thus, the loss function LpL_{p} preserving the pairwise similarity is summarized as follow:

minΘLp=i=1ijNj=1NL(f(Θ;xi),f(Θ;xj);sij)\min_{\Theta}L_{p}=\sum_{i=1\atop i\neq j}^{N}\sum_{j=1}^{N}L(f(\Theta;x_{i}),f(\Theta;x_{j});s_{ij}) (1)

The above loss function expects to minimize the Hamming distance between any pair of similar samples involving xix_{i} while maximizing the distance between any pair of dissimilar samples involving xix_{i}. Since the expensive computation cost of (1), the pairwise-based deep supervised hashing usually samples a portion of data from the whole dataset for training. The insufficient utilization of the supervised information leads to the learned binary codes BB less discriminative for image retrieval.

Refer to caption
Figure 1: A toy example demonstrates the drawback of original DCWH (a) and its improvement (b). Two classes of datapoints are sampled from two independent Gaussian distributions with the same variance under every case. The centers of two classes are located at (0,1) and (1,1) respectively in (a), whereas at (0,1) and (1,0) in (b). xy-axis represents the coordinates of data points while z-axis represents the value of probability density.

II-B Deep Supervised Hashing with Classwise Loss

Inspired from deep metric learning [12], DCWH [10] first applies a classwise loss for hashing learning, which directly uses the label information as supervision. The classwise loss can be formulated as a cross-entropy based normalized Gaussian distribution:

minΘ,ML1=i=1Nj=1Cyjilogp(yji=1|xi;Θ;M)=i=1Nj=1Cyjilogexp{hiμj22σ2}j=1Cexp{hiμj22σ2}\begin{split}\min_{\Theta,M}L_{1}&=-\sum_{i=1}^{N}\sum_{j=1}^{C}y_{ji}\log{p\left(y_{ji}=1|x_{i};\Theta;M\right)}\\ &=-\sum_{i=1}^{N}\sum_{j=1}^{C}y_{ji}\log\frac{\exp\left\{-\frac{\|h_{i}-\mu_{j}\|^{2}}{2\sigma^{2}}\right\}}{\sum_{j=1}^{C}\exp\left\{-\frac{\|h_{i}-\mu_{j}\|^{2}}{2\sigma^{2}}\right\}}\end{split} (2)

where hi=f(Θ;xi)h_{i}=f(\Theta;x_{i}) is the output of feature learning part, M={μj}j=1CM=\{\mu_{j}\}_{j=1}^{C} are the class centers, and σ2\sigma^{2} is a hyperparameter controlling the class variance. Note the scale parameter 12πσ\frac{1}{\sqrt{2\pi}\sigma} is omitted for clarity, since it does not affect the final result. MM is updated periodically as μj=(i=1Nyjihi)/nj\mu_{j}=\left(\sum_{i=1}^{N}\mathrm{y}_{ji}h_{i}\right)/n_{j}, where njn_{j} is the number of samples with the jjth class label. By introducing class centers, (2) aims to maximize the probability of assigning intra-class samples to the corresponding class center. However, it cannot guarantee the samples belonging to different classes but lying far from the mean of each Gaussian distribution is closer to their intra-class samples. An illustrative example is shown in Fig. 1. We can see, increasing the Hamming distance between different class centers helps to avoid mis-retrieved datapoints lying on the boundary. We elaborate on how to improve this centers discriminability from next section.

Refer to caption
Figure 2: Illustration of IDCWH. It learns the network parameters Θ\Theta in the feature learning part and the class centers M={μj}M=\{\mu_{j}\} in the dual loss part. The proposed centers similarity learning first estimates binary centers UU, then centers similarity can be indicated by the inner product between UU and MM with the supervision of SS.

III Improving Classwise Hashing with Centers Similarity Learning

III-A Problem Definition

To alleviate the aforementioned issue in DCWH, the proposed IDCWH introduces a novel loss function on centers similarity. For simplicity, we adopt the formulation of original classwise loss in (2) as the base loss and regard the newly designed loss as a regularization term. Note that, we now treat class centers as learnable parameters instead of being updated manually and periodically. IDCWH provides an end-to-end framework combining efficient hashing codes and class centers learning. Specially, we aim to learn the network parameters Θ\Theta in the hashing learning part and the class centers M={μj}j=1CM=\{\mu_{j}\}_{j=1}^{C} in the loss layer. The proposed centers similarity learning has two functionalities: firstly, it dynamically attracts the class centers to concentrate on intra-class samples, which interacts with the original classwise loss to minimize the samples-center distance. Secondly, it maximizes the distance between intra-class samples and centers belonging to other classes for discriminative retrieval. The overview of IDCWH is shown in Fig. 2.

III-B Two-step Centers Similarity Learning

III-B1 Intra-class Samples Clustering

One straightforward way to improve the discriminability of the learnable centers is enlarging the Hamming distance between pairwise class centers. However, this may cause a problem that, the learned class centers cannot follow the distribution of the hashing outputs. Consequently, the training procedure may become unstable and hard to converge. To tackle this issue, we alternatively first estimate a center with binary constraint, which clusters the intra-class data points. Denote the approximated center of jjth class as uju_{j}, then we have the following objective function:

minuji=1Nj=1Cyjiujbi22s.t.uj{1,1}l\min_{u_{j}}\sum_{i=1}^{N}\sum_{j=1}^{C}y_{ji}{\left\|u_{j}-b_{i}\right\|}_{2}^{2}\quad s.t.\;u_{j}\in\{-1,1\}^{l} (3)

(3) can be reduced to the following formulation, which clusters intra-class samples to each approximated center individually:

minujk=1njujbjk22=k=1njujTuj+k=1njbjkTbjk2ujTk=1njbjk\begin{split}&\min_{u_{j}}\sum_{k=1}^{n_{j}}{\left\|u_{j}-b_{jk}\right\|}_{2}^{2}\\ =\sum_{k=1}^{n_{j}}u_{j}^{T}&u_{j}+\sum_{k=1}^{n_{j}}b_{jk}^{T}b_{jk}-2u_{j}^{T}\sum_{k=1}^{n_{j}}b_{jk}\end{split} (4)

where njn_{j} is the number of samples belonging to jjth class and bjkb_{jk} represents the kkth binary codes with the class label jj. Since both ujTuju_{j}^{T}u_{j} and bjkTbjkb_{jk}^{T}b_{jk} are constants with the binary constraint, (4) can be simplified as:

maxujujTk=1njbjks.t.uj{1,1}l\max_{u_{j}}u_{j}^{T}\sum_{k=1}^{n_{j}}b_{jk}\quad s.t.\;u_{j}\in\{-1,1\}^{l} (5)

Denote the sum of all the intra-class binary codes k=1njbjk\sum_{k=1}^{n_{j}}b_{jk} as mjm_{j}, the solution of (5) is a closed-form equation shown as:

uj,v=𝐬𝐠𝐧(mj)={1,mj,v01,mj,v<0u_{j,v}=\mathbf{sgn}(m_{j})=\left\{\begin{array}[]{cc}1,&{m_{j,v}\geq 0}\\ -1,&{m_{j,v}<0}\end{array}\right. (6)

where 𝐬𝐠𝐧()\mathbf{sgn}(\cdot) is the binary function and v=1,2,,lv=1,2,\ldots,l is the vector entry. (6) is similar to the process of voting for every bit of the binary center within the samples of the same class.

III-B2 Centers Concentration and Repelling

After obtaining the uju_{j} of intra-class samples, the second step of the proposed method aims to minimize the Hamming distance from uju_{j} to the learnable class center μj\mu_{j} while maximizing the distance from uju_{j} to other learnable class centers. Consider the training process is in mini-batch manner, we denote the estimated binary centers of all the unique labels appearing in mini-batch data as U={ui}i=1ZU=\{u_{i}\}_{i=1}^{Z}. Then, we can build a similarity matrix SZ×CS\in\mathbb{R}^{Z\times{C}} to connect the similarity relations between UU and MM. SS is defined as follows:

sij={1,yuiTyμj=10,yuiTyμj=0s_{ij}=\left\{\begin{array}[]{cc}1,&{y_{u_{i}}^{T}y_{\mu_{j}}=1}\\ \\ 0,&{y_{u_{i}}^{T}y_{\mu_{j}}=0}\end{array}\right. (7)

(7) means: if the learnable class center and the estimated binary center are within the same class, then their similarity is encoded as ‘1’, otherwise ‘0’. Thus, our designed loss function should measure how well each pair of uiu_{i} and μj\mu_{j} matches given their similarity sijs_{ij}.

Suppose both uiu_{i} and μj\mu_{j} are binary vectors, the Hamming distance between uiu_{i} and μj\mu_{j} is 0.5(luiTμj)0.5(l-u_{i}^{T}\mu_{j}). Let θij=0.5uiTμj\theta_{ij}=0.5u_{i}^{T}\mu_{j} denotes half of the inner product of uiu_{i} and μj\mu_{j}. Inspired from the widely-used likelihood in pairwise labels-based hashing [13], we present the likelihood of centers similarity sijs_{ij} as following:

p(sij|ui;μj)={ρ(θij),sij=11ρ(θij),sij=0p\left(s_{ij}|u_{i};\mu_{j}\right)=\left\{\begin{array}[]{cc}\rho(\theta_{ij}),&{s_{ij}=1}\\ 1-\rho(\theta_{ij}),&{s_{ij}=0}\end{array}\right. (8)

where ρ()\rho(\cdot) is sigmoid function. From (8), it is clear that the larger the inner product of uiu_{i} and μj\mu_{j}, the bigger the p(1|ui,μj)p(1|u_{i},\mu_{j}) while the smaller the p(0|ui,μj)p(0|u_{i},\mu_{j}). It corresponds with our expectation that the learnable class center is more likely to cluster intra-class samples while lying far away from the samples belonging to other classes. Noted that there is a binary assumption on μj\mu_{j} in (8) while μj\mu_{j} is a real-valued learnable parameter in implementation. Therefore, we replace uiTμju_{i}^{T}\mu_{j} by its approximation with continuous relaxation. Observing that, there exists a nice relationship between uiTμju_{i}^{T}\mu_{j} with binary constraint and their cosine similarity:

uiTμj=luiTuiμjμj=lcos(ui,μj)u_{i}^{T}\mu_{j}=l\frac{u_{i}^{T}}{\|u_{i}\|}\frac{\mu_{j}}{\|\mu_{j}\|}=l\cos{\left(u_{i},\mu_{j}\right)} (9)

where \|\cdot\| is 2\ell_{2} norm. Hence, we alternatively adopt θij=0.5lcos(ui,μj)\theta_{ij}=0.5l\cos{(u_{i},\mu_{j})} in (8).

Based on (8), we present the loss function of the proposed centers similarity learning, which is the negative log likelihood shown as:

minML2=logi=1,,Zj=1,,Cp(sij|ci;μj)=sijS(sijθijlog(1+eθij))\begin{split}&\min_{M}L_{2}=-\log\prod_{i=1,\ldots,Z\atop j=1,\ldots,C}p\left(s_{ij}|c_{i};\mu_{j}\right)\\ &=-\sum_{s_{ij}\in S}\left(s_{ij}\theta_{ij}-\log(1+e^{\theta_{ij}})\right)\end{split} (10)

By adding (10) as a regularization term to (2), we obtain the following loss function:

minΘ,ML1+γL2=i=1Nj=1Cyjilogexp{hiμj22σ2}j=1Cexp{hiμj22σ2}\displaystyle\min_{\Theta,M}L_{1}+\gamma L_{2}=-\sum_{i=1}^{N}\sum_{j=1}^{C}y_{ji}\log\frac{\exp\left\{-\frac{\|h_{i}-\mu_{j}\|^{2}}{2\sigma^{2}}\right\}}{\sum_{j=1}^{C}\exp\left\{-\frac{\|h_{i}-\mu_{j}\|^{2}}{2\sigma^{2}}\right\}}
γsijS(sijθijlog(1+eθij))\displaystyle\qquad\qquad\qquad\quad\>-\gamma\sum_{s_{ij}\in S}\left(s_{ij}\theta_{ij}-\log(1+e^{\theta_{ij}})\right) (11)
\IEEEeqnarraymulticol1cs.t.hi=bi,hil×1,bi={1,1}l\displaystyle\IEEEeqnarraymulticol{1}{c}{s.t.\quad h_{i}=b_{i},\quad h_{i}\in\mathbb{R}^{l\times 1},\quad b_{i}=\{-1,1\}^{l}}

where γ\gamma is a hyper-parameter balancing the class-wise loss L1L_{1} and centers similarity loss L2L_{2}.

Finally, we need to handle the discrete optimization of hih_{i}. Follow the strategy in [6], we relax hih_{i} from discrete to continuous, while introducing a quantization error between hih_{i} and bi=𝐬𝐠𝐧(hi)b_{i}=\mathbf{sgn}(h_{i}). By integrating with the quantization term, we obtain the finalized objective function shown as following:

minΘ,ML=L1+γL2+βi=1Nbihi22\begin{split}&\min_{\Theta,M}L=L_{1}+\gamma L_{2}+\beta\sum_{i=1}^{N}\left\|b_{i}-h_{i}\right\|_{2}^{2}\end{split} (12)

where β\beta is a hyper-parameter controlling the weight of the quantization term.

III-C Implementation Details and Optimization

Since we train the network in mini-batch iteration, to capture the information of all the intra-class samples in the hashing codes space, we dynamically estimate the binary centers U={ui}i=1ZU=\{u_{i}\}_{i=1}^{Z} from the current binary sum and newly sampled data. Recall (6), we denote the current binary sum of all the intra-class samples belonging to jjth class as 𝐬𝐮𝐦j(t)\mathbf{sum}_{j}^{(t)}. Then, for each iteration when new samples of jjth class appear, uju_{j} is updated as follows:

𝐬𝐮𝐦j(t+1)=𝐬𝐮𝐦j(t)+mj(t+1)uj(t+1)=𝐬𝐠𝐧(𝐬𝐮𝐦j(t+1))\begin{split}\mathbf{sum}_{j}^{(t+1)}&=\mathbf{sum}_{j}^{(t)}+m_{j}^{(t+1)}\\ u_{j}^{(t+1)}&=\mathbf{sgn}(\mathbf{sum}_{j}^{(t+1)})\end{split} (13)

where mj(t+1)=k=1nj𝐬𝐠𝐧(hjk)m_{j}^{(t+1)}=\sum_{k=1}^{n_{j}}\mathbf{sgn}(h_{jk}) represents the sum of binarized hashing outputs belonging to jjth class in a newly-come mini-batch data.

With each forward pass in the training procedure, we obtain the hashing outputs {hi}\{h_{i}\} of the network and the estimated binary centers UU. Then we take the derivatives of LL w.r.t. hih_{i} and μj\mu_{j} individually and update μj\mu_{j} by gradient descent, shown as:

Lhi=L1hi2β(bihi);Lμj=L1μj+γL2μj\frac{\partial L}{\partial h_{i}}=\frac{\partial L_{1}}{\partial h_{i}}-2\beta(b_{i}-h_{i});\quad\frac{\partial L}{\partial\mu_{j}}=\frac{\partial L_{1}}{\partial\mu_{j}}+\gamma\frac{\partial L_{2}}{\partial\mu_{j}} (14)

And L2/μj{\partial L_{2}}/{\partial\mu_{j}} can be derived as:

L2μj=12i=1Zui(rijsij)\frac{\partial L_{2}}{\partial\mu_{j}}=\frac{1}{2}\sum_{i=1}^{Z}u_{i}(r_{ij}-s_{ij}) (15)

where rij=ρ(0.5uiTμj)r_{ij}=\rho(0.5u_{i}^{T}\mu_{j}). Then, the gradients are backpropagated to the feature learning part to update network parameters Θ\Theta. The overall learning procedure of IDCWH is summarized in Algorithm 1.

Algorithm 1 Learning Algorithm for IDCWH

Input: Training data X={xi}i=1NX=\{x_{i}\}_{i=1}^{N}, label matrix YC×NY\in\mathbb{R}^{C\times{N}}.
Parameter: Hyper parameters σ2\sigma^{2},γ\gamma and β\beta; code length LL and batch size bsbs.
Output: Hashing function f()f(\cdot), class centers M={μj}j=1CM=\{\mu_{j}\}_{j=1}^{C}.
Initialize network parameters Θ\Theta by Glorot distribution, class centers MM by normal distribution; epoch T=0T=0.

1:  Repeat
2:     T=T+1T=T+1;
3:     U=𝟎U=\mathbf{0}
4:     for Nbs\lfloor\frac{N}{bs}\rfloor iterations do
5:        Randomly sample a mini-batch data {xi}i=1bs\{x_{i}\}_{i=1}^{bs};
6:        Compute hi=f(Θ;xi)h_{i}=f(\Theta;x_{i}) and bi=𝐬𝐠𝐧(hi)b_{i}=\mathbf{sgn}(h_{i}) by forward propagation;
7:        Update {u}k=1Z\{u\}_{k=1}^{Z} for each unique class label appearing in the batch by (13);
8:        Compute the loss function LL in (12);
9:        Calculate derivatives of LL concerning hih_{i} and μj\mu_{j} individually by (14) and (15);
10:        Update MM and Θ\Theta by back propagation;
11:     end for
12:  Until Convergence

III-D Out-of-Sample Extension

After completing the training procedure, the learned network can generate binary hashing codes for unseen samples. We obtain the binary code btest,ib_{test,i} of a given test sample xix_{i} by simply making a forward pass of the network and taking the binary operation, i.e.:

btest,i=𝐬𝐠𝐧(f(Θ;xi))b_{test,i}=\mathbf{sgn}(f(\Theta;x_{i})) (16)

IV Experiments

To verify the superiority and effectiveness of our proposed IDCWH methods, we conduct extensive experiments on three large-scale benchmark datasets, i.e. CIFAR-10, CIFAR-100 [14] and MS-COCO [15]. All the experiments are run by PyTorch with two Nvidia RTX2080 Ti GPU cards.

IV-A Experiments Settings

CIFAR-10 [14] is a standard dataset for image recognition with 60,000 images in 10 classes. The official split contains 50,000 training images with 5,000 images per class, and 10,000 testing images with 1,000 per class. We conduct two groups of experiments following different protocols as used in [6, 11, 7]. In the first protocol, we randomly choose 100 images per class (1000 images in total) as query set, and 500 images per class (5000 images in total) as training set. The rest is all used as the database. In the second protocol, all the images in the official training split are used for training and all the images in the test split are used for query.

CIFAR-100 [14] is similar to CIFAR-10 except that it has 100 classes with 600 images per class. We follow the protocol as in DCWH [10], which adopts the official training set and testing set of CIFAR-100. Specifically, we use 100 images per class (10,000 in total) as a query set and 500 images per class (50,000 in total) for training.

MS-COCO [15] is a popular dataset for image recognition, object segmentation and captioning. The official release of ‘train2014’ and ‘val2014’ contains 82,783 training images and 40,504 validation images. Each image is labelled by some of the 80 semantic concepts. Follow the protocol in [9], we combine the training and validation images together and randomly sample 5,000 images as query set and 10,000 images as a training set. The reminder images are all used as the database.

We adopt the standard similarity protocol for evaluation as in prior works. Specifically, for the single-label datasets, i.e. CIFAR-10 and CIFAR-100, images from the same class are regarded as similar, otherwise dissimilar. For the multi-label dataset, i.e. MS-COCO, any pair of images sharing at least one common label are regarded as similar, otherwise dissimilar. In addition to Mean Average Precision (MAP), four other commonly-used evaluation metrics are used in experiments, i.e. Precision-Recall curves (PR), Precision curve within two Hamming distances w.r.t. different code lengths (P@H=2), Precision curves w.r.t. different number of top returned images (P@N) and Recall curves with two Hamming distances w.r.t. different code lengths (R@H=2).

We compare our proposed method with original DCWH [10] and several classical deep supervised hashing methods including DPSH [6], DTSH [11], DSDH [7] and HashNet [9]. Besides, we also compare the performance with traditional supervised methods i.e. SDH [4] and FSDH [5], combined with deep feature learning. The two methods are denoted as SDH+CNN and FSDH+CNN. For the feature learning part, we simply employ GoogLeNet [16] with batch normalization [17] as the backbone network. The network has been pretrained on ImageNet. Since some other works apply the different network architecture with us, for the fair comparison, we re-run the experiments with the codes released by authors and we report performances of both the original works and their counterparts with GoogLeNet. The latter are marked with an asterisk, e.g. DPSH* and DSDH*.

The configurations of proposed IDCWH are shown as follows: we finetune the pretrianed GoogLeNet from the input to the last layer before the fully-connected layer and train the FC hashing layer (as shown in Fig. 2) for 150 epochs. We use mini-batch stochastic gradient descent (SGD) with momentum 0.9 and weight decay 5e-4. The initial learning rates are 1e-2 and 5e-3 for feature learning part and centers learning, respectively. And the learning rates are decayed by 0.1 every 50 epochs. The three hyperparameters adopted by cross-validation are σ2=4\sigma^{2}=4, γ=1\gamma=1 and β=0.01\beta=0.01. The batch size is fixed to 128 during all the experiments.

TABLE I: MAP results by Hamming ranking on CIFAR-10 under two experiment protocols
Method CIFAR-10 (mini) CIFAR-10 (full)
12 bits 24 bits 32 bits 48 bits 12 bits 24 bits 32 bits 48 bits
SDH+CNN 0.207 0.218 0.223 0.210 0.364 0.433 0.405 0.414
FSDH+CNN 0.196 0.220 0.203 0.212 0.374 0.443 0.410 0.446
DPSH 0.713 0.727 0.744 0.757 0.763 0.781 0.795 0.807
DPSH* 0.797 0.806 0.820 0.802 0.908 0.922 0.925 0.935
DTSH 0.710 0.750 0.765 0.774 0.915 0.923 0.925 0.926
DTSH* 0.790 0.797 0.794 0.775 0.928 0.935 0.940 0.942
DSDH 0.740 0.786 0.801 0.820 0.935 0.940 0.939 0.939
DSDH* 0.800 0.802 0.804 0.808 0.913 0.925 0.943 0.930
DCWH 0.818 0.840 0.848 0.850 0.940 0.950 0.954 0.952
IDCWH 0.828\mathbf{0.828} 0.865\mathbf{0.865} 0.868\mathbf{0.868} 0.859\mathbf{0.859} 0.964\mathbf{0.964} 0.969\mathbf{0.969} 0.967\mathbf{0.967} 0.968\mathbf{0.968}
Refer to caption
Figure 3: Results on CIFAR-10 dataset. Left: P@H=2 w.r.t. different code lengths; Middle: P@N under 32-bit codes; Right: R@H=2 w.r.t. different code lengths.

IV-B Results

We first present the evaluation on CIFAR-10 dataset. Denote the first protocol and second protocol of the experiments as CIFAR-10 (mini) and CIFAR-10 (full), respectively. The MAP results of compared methods are shown in Table I. We can see that the proposed IDCWH outperforms other methods under almost all the cases of two experiment protocols. Among them, the classwise-based deep hashing methods perform better than other pairwise/triplet labels-based methods. And IDCWH surpasses the performance of DCWH by 1-2%. Specifically, in the first protocol, it achieves the best MAP performance with 86.8% under 32-bit codes. And the best MAP result on the second protocol is 96.9% under 24-bit codes. One can find, DPSH* and DTSH* with employed GoogLeNet backbone, perform significantly better than the original works with their counterpart networks. While the performance of the combination methods of traditional supervised hashing and deep feature learning, i.e., SDH+CNN and FSDH+CNN, is really poor. It verifies the superiority of end-to-end training manner of deep supervised hashing. One can also observe, by training with more samples, the performances on the second protocol of all the methods are greatly improved compared to those of the first protocol.

The performance on CIFAR-10 under other evaluation metrics is shown in Fig. 3. It is clear that the proposed IDCWH is robust as it achieves the best result under various evaluation metrics. From Fig. 3, one can observe two common phenomenons from all the compared methods. The first is, the curve of P@H=2 always first increases then drops with the growth of code length. The second is the overall trend of R@H=2 is decreasing with the increase in code length. The reason may be explained as: with the dimension of Hamming space, the learned binary codes become sparser, which leads to fewer data points fallen into the range within two Hamming distances.

TABLE II: MAP results by Hamming ranking on CIFAR-100 dataset
Method CIFAR-100
12 bits 24 bits 32 bits 48 bits
SDH+CNN 0.0617 0.0624 0.0610 0.0668
FSDH+CNN 0.0596 0.0618 0.0650 0.0665
DPSH 0.0597 0.1008 0.1196 0.1587
DTSH 0.6070 0.7056 0.7122 0.7252
DSDH 0.0784 0.1495 0.1868 0.2272
DCWH 0.7227 0.7441 0.7570 0.7658
IDCWH 0.7642\mathbf{0.7642} 0.8130\mathbf{0.8130} 0.8236\mathbf{0.8236} 0.8351\mathbf{0.8351}

We further present the experimental results on CIFAR-100 dataset. The performance of the MAP on CIFAR-100 is presented in Table II. We can see the proposed IDCWH achieves the best results under all the compared code lengths and it surpasses the second place, i.e. DCWH with an at least 4% margin. Specifically, it greatly improves the MAP scores to more than 80% with codes length not less than 24-bit, which is nearly 7% higher than DCWH under all the cases. While for 12-bit codes, it outperforms other methods with 76.42% MAP performance.

For other pairwise/triplet labels-based methods, the performance on CIFAR-100 is much poorer than that on CIFAR-10. However, IDCWH and DCWH, both of which are based on classwise loss, can still achieve promising results. Consider the increase in the number of classes and the less number of training images in CIFAR-100. It verifies the effectiveness of classwise hashing, which can better capture the semantics information for a large-scale dataset with small inter-class variations and significant intra-class variations. The performance on other evaluation metrics, i.e. P@2, P@N and PR curves, are illustrated in Fig. 4. We can see the proposed IDCWH again preforms the best under various evaluation metrics. Observe that, there is a drastic drop on the curve of P@N when N=500, this corresponds with the fact that the number of training images per class in CIFAR-100 is 500. One can also find that, among all the pairwise/triplet labels-based methods, DTSH performs much better than DPSH and DSDH.

Finally, we summarize the evaluation on the multi-labeled MS-COCO dataset. The performance with HashNet, DHN [18] and DNNH [19] are added, which are directly taken in [9] and we also run experiments with DSDH and DPSH methods. The MAP result of all the compared methods on MS-COCO is presented in Table III. We can see IDCWH performs slightly better than DCWH under each code length and it achieves the highest MAP scores among all the methods. Recall the formulation of IDCWH, it treats each class center independently as learnable hyperparameters instead of updating the center manually. Thus, it can be extended to multi-label cases feasibly without computing the average of any new combination of multi-class centers like in DCWH. One can also observe the performances of DPSH and DSDH are poor on MS-COCO. It again validates our expectation that, by leveraging the classwise difference between all the samples, it helps to generate more semantics-preserving binary codes.

Refer to caption
Figure 4: Results on CIFAR-100 dataset. Left: P@H=2 w.r.t. different code lengths; Middle: P@N with 48-bit codes; Right: PR curve with 48-bit codes.
TABLE III: MAP results by Hamming ranking on MS-COCO dataset
Method MS-COCO
16 bits 32 bits 48 bits 64 bits
DPSH 0.3493 0.3545 0.3595 0.3670
DSDH 0.3470 0.3587 0.3661 0.3703
DNNH 0.5932 0.6034 0.6045 0.6099
DHN 0.6774 0.7013 0.6948 0.6944
HashNet 0.6873 0.7184 0.7301 0.7362
DCWH 0.7227 0.7441 0.7570 0.7658
IDCWH 0.7321\mathbf{0.7321} 0.7597\mathbf{0.7597} 0.7636\mathbf{0.7636} 0.7698\mathbf{0.7698}

IV-C Empirical Analysis

IV-C1 Visualization of Hashing Codes

To intuitively compare the capability of different methods, we visualize the generated 48-bit hashing codes by DTSH, DCWH and IDCWH on CIFAR-100 dataset. For a more comprehensive comparison, we also visualize the original 1024-dimensional features extracted from deep CNNs. We first randomly sample 10 categories out of 100 categories in CIFAR-100, then generate the binary codes of all the testing images belonging to the 10 categories. Finally, t-SNE [20] is used to map the 48-bit codes/ 1024-dimensional features to 2-dimensional features for visualization. The results are shown in Fig. 5. We observe that the hashing codes generated by IDCWH are most compact and discriminative. While the hashing codes generated by DTSH and DCWH are not well separated, which also reflects in the ratio of within-class distances and between-class distances of each method.

Refer to caption
Figure 5: The t-SNE visualization of hashing codes learned from deep features, DTSH, DCWH and the proposed IDCWH, respectively. dwd_{w} and dbd_{b} represent the within-class distance and the between-class distances.

IV-C2 Ablation Study

We conduct experiments to investigate the effect of the proposed centers similarity learning. Denote the finalized loss function without using L2L_{2} shown in (10) as IDCWH-Single. The performance comparison of two configurations on three datasets is shown in Table IV. We can see IDCWH steadily performs better than IDCWH-Single under all the cases. It validates the functionality of centers similarity learning, which contributes to make the class center more concentrate on the intra-class samples while pushing other class centers as far as possible. Another observation from Table  IV is, the contribution of L2L_{2} is more distinct on shorter code lengths than the relatively longer code lengths.

TABLE IV: Comparison on MAP with single classwise loss and the combination with centers similarity loss
Method CIFAR-10 (mini) CIFAR-100 MS-COCO
12-bit 24-bit 32-bit 48-bit 12-bit 24-bit 32-bit 48-bit 16-bit 32-bit 48-bit 64-bit
IDCWH-Single 0.7944 0.8239 0.8416 0.8420 0.7015 0.7393 0.7638 0.7701 0.6963 0.7125 0.7490 0.7529
IDCWH 0.8284\mathbf{0.8284} 0.8650\mathbf{0.8650} 0.8681\mathbf{0.8681} 0.8586\mathbf{0.8586} 0.7642\mathbf{0.7642} 0.8130\mathbf{0.8130} 0.8236\mathbf{0.8236} 0.8351\mathbf{0.8351} 0.7321\mathbf{0.7321} 0.7597\mathbf{0.7597} 0.7636\mathbf{0.7636} 0.7698\mathbf{0.7698}

IV-C3 Parameters Sensitivity Analysis

We further present the investigation on the influence of hyperparameters on the IDCWH. We focus on two hyperparameters in our work, i.e. σ2\sigma^{2} which controls the variance of intra-class samples to the corresponding class center and β\beta which penalizes the quantization error. Specifically, we test various σ2\sigma^{2} ranging within {0.5,1,2,4,8}\{0.5,1,2,4,8\} and β\beta ranging with {0.001,0.01,0.1}\{0.001,0.01,0.1\}. The experiments are conducted on CIFAR-100 dataset and the result is illustrated in Fig. 6. From Fig. 6, we observe the performances of σ2\sigma^{2} and β\beta share similar patterns with the growth on the values, i.e. first rises then descends. Thus, we choose σ2=4\sigma^{2}=4 and β=0.01\beta=0.01 in this paper, which should be a promising configuration.

Refer to caption
Figure 6: Parameters sensitivity analysis. Left: MAP w.r.t. various σ2\sigma^{2} in 48-bit and 12-bit codes, respectively. Right: PR curve w.r.t. different values of β\beta in 48-bit codes.

V Conclusion

This paper addressed the limitation of prior works on deep supervised hashing and proposed an improved framework with centers similarity learning. To achieve this, firstly, it clusters intra-class samples to the learnable class centers and dynamically estimates binary centers within each iteration. To enhance the hashing codes discriminativity, it then minimizes the Hamming distance from the learnable class center to the corresponding estimated binary center while maximizing other estimated centers’ distance. Experiments on three benchmark datasets verify that the proposed method can effectively boost original deep classwise hashing and yield state-of-the-art retrieval results.

Acknowledgment

This work is supported by Hong Kong Research Grants Council (Project C1007-15G) and City University of Hong Kong (Projects 9610034 and 9610460).

References

  • [1] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in Advances in neural information processing systems, 2009, pp. 1753–1760.
  • [2] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin, “Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 12, pp. 2916–2929, 2012.
  • [3] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang, “Supervised hashing with kernels,” in Proceedings of the IEEE conference on computer vision and pattern recognition.   IEEE, 2012, pp. 2074–2081.
  • [4] F. Shen, C. Shen, W. Liu, and H. Tao Shen, “Supervised discrete hashing,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 37–45.
  • [5] J. Gui, T. Liu, Z. Sun, D. Tao, and T. Tan, “Fast supervised discrete hashing,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 2, pp. 490–496, 2017.
  • [6] W.-J. Li, S. Wang, and W.-C. Kang, “Feature learning based deep supervised hashing with pairwise labels,” in Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence.   AAAI Press, 2016, pp. 1711–1717.
  • [7] Q. Li, Z. Sun, R. He, and T. Tan, “Deep supervised discrete hashing,” in Advances in neural information processing systems, 2017, pp. 2482–2491.
  • [8] Y. Chen, Z. Lai, Y. Ding, K. Lin, and W. K. Wong, “Deep supervised hashing with anchor graph,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 9796–9804.
  • [9] Z. Cao, M. Long, J. Wang, and P. S. Yu, “Hashnet: Deep learning to hash by continuation,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 5608–5617.
  • [10] X. Zhe, S. Chen, and H. Yan, “Deep class-wise hashing: Semantics-preserving hashing via class-wise loss,” IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 5, pp. 1681–1695, 2020.
  • [11] X. Wang, Y. Shi, and K. M. Kitani, “Deep supervised hashing with triplet labels,” in Asian conference on computer vision.   Springer, 2016, pp. 70–84.
  • [12] H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese, “Deep metric learning via lifted structured feature embedding,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4004–4012.
  • [13] P. Zhang, W. Zhang, W.-J. Li, and M. Guo, “Supervised hashing with latent factor models,” in Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, 2014, pp. 173–182.
  • [14] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.
  • [15] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision.   Springer, 2014, pp. 740–755.
  • [16] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
  • [17] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International Conference on Machine Learning, 2015, pp. 448–456.
  • [18] H. Zhu, M. Long, J. Wang, and Y. Cao, “Deep hashing network for efficient similarity retrieval,” in Thirtieth AAAI Conference on Artificial Intelligence, 2016.
  • [19] H. Lai, Y. Pan, Y. Liu, and S. Yan, “Simultaneous feature learning and hash coding with deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3270–3278.
  • [20] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of machine learning research, vol. 9, no. Nov, pp. 2579–2605, 2008.