Improved Deep Classwise Hashing With Centers Similarity Learning for Image Retrieval

Ming Zhang Department of Electrical Engineering
City University of Hong Kong
Hong Kong, China
[email protected] Hong Yan Department of Electrical Engineering
City University of Hong Kong
Hong Kong, China
[email protected]

Abstract

Deep supervised hashing for image retrieval has attracted researchers’ attention due to its high efficiency and superior retrieval performance. Most existing deep supervised hashing works, which are based on pairwise/triplet labels, suffer from the expensive computational cost and insufficient utilization of the semantics information. Recently, deep classwise hashing introduced a classwise loss supervised by class labels information alternatively; however, we find it still has its drawback. In this paper, we propose an improved deep classwise hashing, which enables hashing learning and class centers learning simultaneously. Specifically, we design a two-step strategy on center similarity learning. It interacts with the classwise loss to attract the class center to concentrate on the intra-class samples while pushing other class centers as far as possible. The centers similarity learning contributes to generating more compact and discriminative hashing codes. We conduct experiments on three benchmark datasets. It shows that the proposed method effectively surpasses the original method and outperforms state-of-the-art baselines under various commonly-used evaluation metrics for image retrieval.

Index Terms:

image retrieval, deep hashing, convolutional neural networks

I Introduction

In the big data era, there is an explosive growth of data uploaded to the Internet every day. How to search the high dimensional data in a large-scale database both efficiently and effectively becomes more important for various social platforms and search engines. Given a query image, the task of content-based image retrieval (CBIR) is finding all the related items to the query in the database according to their visual features. CBIR has witnessed great progress in the last decades. Suppose both images in the database and the query image are encoded in real-valued features. Thus, to find the closest items in the database to the query, it is realized by ranking the database items by their distance to the query in the feature space. However, for a database with millions of images, the computation and storage cost by using real-valued features are expensive. Moreover, it brings a high burden to the mobile or embedded devices where the resources are limited for both storage and latency.

As a powerful approximate nearest neighbor search technique, hashing has been widely employed for information retrieval. The goal of hashing is mapping high dimensional data into lower dimensional binary codes while preserving their original similarity relations of the input space. Preformed with the binary codes, it takes the advantages of extremely fast retrieval speed and low storage cost. Generally, existing data-dependent hashing methods can be categorized into unsupervised hashing [1, 2] and supervised hashing [3, 4, 5].Unsupervised hashing methods learn hashing functions in different distance measurements without considering the label information. Whereas, supervised hashing incorporates labels information e.g. pairwise labels into the hashing learning process, which is known to outperform the unsupervised hashing with the same code length.

With the great success of deep learning in the past years, deep supervised hashing [6, 7, 8, 9, 10] proposes to apply deep neural networks to extract feature representations for hashing functions learning. Among them, some works with end-to-end training manners have greatly improved the retrieval performance and shown superior results to traditional supervised hashing methods. Most existing deep supervised hashing utilizes the pairwise label information to examine the similarity between pairs of samples. A representative work is Deep Pairwise Supervised Hashing (DPSH) [6], which minimizes the Hamming distance between each pair of similar samples while maximizing the Hamming distance between each pair of dissimilar samples. The similar samples refer to the samples sharing at least one class and their similarity is denoted as ‘1’. While dissimilar samples refer to the samples sharing no common object class and their similarity is denoted as ‘0’ or ‘-1’. Inspired by DPSH, some variations are proposed, e.g., Deep Supervised Hashing with Triplet labels (DTSH) [11] and Deep Supervised Discrete Hashing (DSDH) [7].

The main problem of pairwise/triplet labels-based deep supervised hashing is the tremendous computation cost. The time complexity of the algorithm is $O(n^{2})$ , where $n$ is the number of samples in the training set. In practice, it is solved by a mini-batch sampling of inputs during training. However, this results in a side effect: the labeled data cannot be fully utilized because some supervised information is always discarded during iterative learning. Consequently, the learned binary hashing codes are suboptimal to preserve the semantics relation of the dataset. Some works have been proposed to overcome this issue, e.g., DAGH [8] utilizes the anchor graph to efficiently connect all the training samples and the anchors under the deep hashing framework.

Recently, some novel deep hashing methods propose directly use labels information as supervision. One typical example is Deep Classwise Hashing (DCWH) [10], which applies a classwise loss based on a normalized Gaussian distribution. DCWH introduces the class centers in the learning procedure with the essence of minimizing the Hamming distance between intra-class samples to the corresponding class center. Experiments show that DCWH outperforms prior pairwise/triplet labels-based deep hashing works especially on the datasets with a significant number of classes and small inter-class variations. Although DCWH guarantees that the average distance of intra-class samples to the corresponding class center is smaller than that to other centers. It still has its drawback. Concretely, when datapoints belonging to different classes are far from the corresponding class centers, they are prone to be closer to each other than their intra-class samples. We will demonstrate this issue in the next section in detail. To overcome these issues, we propose an improved deep classwise hashing method (IDCWH) with centers similarity learning to generate more compact and discriminative binary codes for image retrieval. The contributions of this paper are summarized as follows:

•

The proposed IDCWH enables networks to learn hashing codes and class centers simultaneously in an end-to-end manner. The dual part of the designed loss mutually contributes to the concentration of intra-class samples on the corresponding class center and it gives better training convergence.
•

We introduce a two-step strategy on centers similarity learning. It first estimates binary centers of each unique class from mini-batch data to represent the intra-class samples. Then, it dynamically attracts the corresponding learnable center to be close to the estimated center while pushing other centers away in the Hamming space.
•

Extensive experiments on three benchmark datasets show the proposed IDCWH outperforms other compared state-of-the-art baselines. It verifies the effectiveness of IDCWH on generating compact and semantics-preserving hashing codes for image retrieval.

II Related Works

Denote the dataset with $N$ training samples as $X=\{x_{i}\}_{i=1}^{N}$ . The label matrix of $X$ is given as $Y=[y_{1},y_{2},\ldots,y_{N}]\in\mathbb{R}^{C\times{N}}$ , where $C$ is the number of unique classes in the dataset and ${y}_{ji}=1$ if $x_{i}$ owns the label of the $j$ th class, otherwise $y_{ji}=0$ . Note that, $y_{i}$ is not limited as a one-hot encoding vector, in order to handle the multi-label cases. The goal of supervised hashing is learning hashing functions to map $X$ into binary codes $B=\{b_{i}\}_{i=1}^{N}\in\mathbb{R}^{l\times{N}}$ with the supervision information in $Y$ , where $b_{i}\in\{-1,1\}^{l}$ and $l$ is the code length. Nowadays, deep supervised hashing [7, 6, 10, 8] applies deep neural networks $f(\cdot)$ to achieve feature learning and hashing learning jointly. Thus, $b_{i}=f(\Theta;x_{i})$ , where $\Theta$ represents the learnable parameters of the network.

II-A Deep Supervised Hashing with Pairwise Labels

Provided the label matrix $Y$ , the pairwise similarity between any pair of samples $x_{i}$ and $x_{j}$ in $X$ can be represented as $s_{ij}$ , where $s_{ij}=y_{i}^{T}y_{j}$ . Thus, $s_{ij}=1$ if $x_{i}$ and $x_{j}$ shares at least one common label, while $s_{ij}=0$ otherwise. Most deep supervised hashing is based on pairwise/triplet labels [6, 7], which learns $f(\cdot)$ to map $X$ from the original space into $B$ in Hamming space while maintaining the semantics similarity in $s_{ij}$ . Thus, the loss function $L_{p}$ preserving the pairwise similarity is summarized as follow:

\min_{\Theta}L_{p}=\sum_{i=1\atop i\neq j}^{N}\sum_{j=1}^{N}L(f(\Theta;x_{i}),f(\Theta;x_{j});s_{ij})

(1)

The above loss function expects to minimize the Hamming distance between any pair of similar samples involving $x_{i}$ while maximizing the distance between any pair of dissimilar samples involving $x_{i}$ . Since the expensive computation cost of (1), the pairwise-based deep supervised hashing usually samples a portion of data from the whole dataset for training. The insufficient utilization of the supervised information leads to the learned binary codes $B$ less discriminative for image retrieval.

Refer to caption — Figure 1: A toy example demonstrates the drawback of original DCWH (a) and its improvement (b). Two classes of datapoints are sampled from two independent Gaussian distributions with the same variance under every case. The centers of two classes are located at (0,1) and (1,1) respectively in (a), whereas at (0,1) and (1,0) in (b). xy-axis represents the coordinates of data points while z-axis represents the value of probability density.

II-B Deep Supervised Hashing with Classwise Loss

Inspired from deep metric learning [12], DCWH [10] first applies a classwise loss for hashing learning, which directly uses the label information as supervision. The classwise loss can be formulated as a cross-entropy based normalized Gaussian distribution:

\begin{split}\min_{\Theta,M}L_{1}&=-\sum_{i=1}^{N}\sum_{j=1}^{C}y_{ji}\log{p\left(y_{ji}=1|x_{i};\Theta;M\right)}\\ &=-\sum_{i=1}^{N}\sum_{j=1}^{C}y_{ji}\log\frac{\exp\left\{-\frac{\|h_{i}-\mu_{j}\|^{2}}{2\sigma^{2}}\right\}}{\sum_{j=1}^{C}\exp\left\{-\frac{\|h_{i}-\mu_{j}\|^{2}}{2\sigma^{2}}\right\}}\end{split}

(2)

where $h_{i}=f(\Theta;x_{i})$ is the output of feature learning part, $M=\{\mu_{j}\}_{j=1}^{C}$ are the class centers, and $\sigma^{2}$ is a hyperparameter controlling the class variance. Note the scale parameter $\frac{1}{\sqrt{2\pi}\sigma}$ is omitted for clarity, since it does not affect the final result. $M$ is updated periodically as $\mu_{j}=\left(\sum_{i=1}^{N}\mathrm{y}_{ji}h_{i}\right)/n_{j}$ , where $n_{j}$ is the number of samples with the $j$ th class label. By introducing class centers, (2) aims to maximize the probability of assigning intra-class samples to the corresponding class center. However, it cannot guarantee the samples belonging to different classes but lying far from the mean of each Gaussian distribution is closer to their intra-class samples. An illustrative example is shown in Fig. 1. We can see, increasing the Hamming distance between different class centers helps to avoid mis-retrieved datapoints lying on the boundary. We elaborate on how to improve this centers discriminability from next section.

III Improving Classwise Hashing with Centers Similarity Learning

III-A Problem Definition

To alleviate the aforementioned issue in DCWH, the proposed IDCWH introduces a novel loss function on centers similarity. For simplicity, we adopt the formulation of original classwise loss in (2) as the base loss and regard the newly designed loss as a regularization term. Note that, we now treat class centers as learnable parameters instead of being updated manually and periodically. IDCWH provides an end-to-end framework combining efficient hashing codes and class centers learning. Specially, we aim to learn the network parameters $\Theta$ in the hashing learning part and the class centers $M=\{\mu_{j}\}_{j=1}^{C}$ in the loss layer. The proposed centers similarity learning has two functionalities: firstly, it dynamically attracts the class centers to concentrate on intra-class samples, which interacts with the original classwise loss to minimize the samples-center distance. Secondly, it maximizes the distance between intra-class samples and centers belonging to other classes for discriminative retrieval. The overview of IDCWH is shown in Fig. 2.

III-B Two-step Centers Similarity Learning

III-B1 Intra-class Samples Clustering

One straightforward way to improve the discriminability of the learnable centers is enlarging the Hamming distance between pairwise class centers. However, this may cause a problem that, the learned class centers cannot follow the distribution of the hashing outputs. Consequently, the training procedure may become unstable and hard to converge. To tackle this issue, we alternatively first estimate a center with binary constraint, which clusters the intra-class data points. Denote the approximated center of $j$ th class as $u_{j}$ , then we have the following objective function:

\min_{u_{j}}\sum_{i=1}^{N}\sum_{j=1}^{C}y_{ji}{\left\|u_{j}-b_{i}\right\|}_{2}^{2}\quad s.t.\;u_{j}\in\{-1,1\}^{l}

(3)

(3) can be reduced to the following formulation, which clusters intra-class samples to each approximated center individually:

\begin{split}&\min_{u_{j}}\sum_{k=1}^{n_{j}}{\left\|u_{j}-b_{jk}\right\|}_{2}^{2}\\ =\sum_{k=1}^{n_{j}}u_{j}^{T}&u_{j}+\sum_{k=1}^{n_{j}}b_{jk}^{T}b_{jk}-2u_{j}^{T}\sum_{k=1}^{n_{j}}b_{jk}\end{split}

(4)

where $n_{j}$ is the number of samples belonging to $j$ th class and $b_{jk}$ represents the $k$ th binary codes with the class label $j$ . Since both $u_{j}^{T}u_{j}$ and $b_{jk}^{T}b_{jk}$ are constants with the binary constraint, (4) can be simplified as:

\max_{u_{j}}u_{j}^{T}\sum_{k=1}^{n_{j}}b_{jk}\quad s.t.\;u_{j}\in\{-1,1\}^{l}

(5)

Denote the sum of all the intra-class binary codes $\sum_{k=1}^{n_{j}}b_{jk}$ as $m_{j}$ , the solution of (5) is a closed-form equation shown as:

u_{j,v}=\mathbf{sgn}(m_{j})=\left\{\begin{array}[]{cc}1,&{m_{j,v}\geq 0}\\ -1,&{m_{j,v}<0}\end{array}\right.

(6)

where $\mathbf{sgn}(\cdot)$ is the binary function and $v=1,2,\ldots,l$ is the vector entry. (6) is similar to the process of voting for every bit of the binary center within the samples of the same class.

III-B2 Centers Concentration and Repelling

After obtaining the $u_{j}$ of intra-class samples, the second step of the proposed method aims to minimize the Hamming distance from $u_{j}$ to the learnable class center $\mu_{j}$ while maximizing the distance from $u_{j}$ to other learnable class centers. Consider the training process is in mini-batch manner, we denote the estimated binary centers of all the unique labels appearing in mini-batch data as $U=\{u_{i}\}_{i=1}^{Z}$ . Then, we can build a similarity matrix $S\in\mathbb{R}^{Z\times{C}}$ to connect the similarity relations between $U$ and $M$ . $S$ is defined as follows:

s_{ij}=\left\{\begin{array}[]{cc}1,&{y_{u_{i}}^{T}y_{\mu_{j}}=1}\\ \\ 0,&{y_{u_{i}}^{T}y_{\mu_{j}}=0}\end{array}\right.

(7)

(7) means: if the learnable class center and the estimated binary center are within the same class, then their similarity is encoded as ‘1’, otherwise ‘0’. Thus, our designed loss function should measure how well each pair of $u_{i}$ and $\mu_{j}$ matches given their similarity $s_{ij}$ .

Suppose both $u_{i}$ and $\mu_{j}$ are binary vectors, the Hamming distance between $u_{i}$ and $\mu_{j}$ is $0.5(l-u_{i}^{T}\mu_{j})$ . Let $\theta_{ij}=0.5u_{i}^{T}\mu_{j}$ denotes half of the inner product of $u_{i}$ and $\mu_{j}$ . Inspired from the widely-used likelihood in pairwise labels-based hashing [13], we present the likelihood of centers similarity $s_{ij}$ as following:

p\left(s_{ij}|u_{i};\mu_{j}\right)=\left\{\begin{array}[]{cc}\rho(\theta_{ij}),&{s_{ij}=1}\\ 1-\rho(\theta_{ij}),&{s_{ij}=0}\end{array}\right.

(8)

where $\rho(\cdot)$ is sigmoid function. From (8), it is clear that the larger the inner product of $u_{i}$ and $\mu_{j}$ , the bigger the $p(1|u_{i},\mu_{j})$ while the smaller the $p(0|u_{i},\mu_{j})$ . It corresponds with our expectation that the learnable class center is more likely to cluster intra-class samples while lying far away from the samples belonging to other classes. Noted that there is a binary assumption on $\mu_{j}$ in (8) while $\mu_{j}$ is a real-valued learnable parameter in implementation. Therefore, we replace $u_{i}^{T}\mu_{j}$ by its approximation with continuous relaxation. Observing that, there exists a nice relationship between $u_{i}^{T}\mu_{j}$ with binary constraint and their cosine similarity:

u_{i}^{T}\mu_{j}=l\frac{u_{i}^{T}}{\|u_{i}\|}\frac{\mu_{j}}{\|\mu_{j}\|}=l\cos{\left(u_{i},\mu_{j}\right)}

(9)

where $\|\cdot\|$ is $\ell_{2}$ norm. Hence, we alternatively adopt $\theta_{ij}=0.5l\cos{(u_{i},\mu_{j})}$ in (8).

Based on (8), we present the loss function of the proposed centers similarity learning, which is the negative log likelihood shown as:

\begin{split}&\min_{M}L_{2}=-\log\prod_{i=1,\ldots,Z\atop j=1,\ldots,C}p\left(s_{ij}|c_{i};\mu_{j}\right)\\ &=-\sum_{s_{ij}\in S}\left(s_{ij}\theta_{ij}-\log(1+e^{\theta_{ij}})\right)\end{split}

(10)

By adding (10) as a regularization term to (2), we obtain the following loss function:

	$\displaystyle\min_{\Theta,M}L_{1}+\gamma L_{2}=-\sum_{i=1}^{N}\sum_{j=1}^{C}y_{ji}\log\frac{\exp\left\{-\frac{\\|h_{i}-\mu_{j}\\|^{2}}{2\sigma^{2}}\right\}}{\sum_{j=1}^{C}\exp\left\{-\frac{\\|h_{i}-\mu_{j}\\|^{2}}{2\sigma^{2}}\right\}}$
	$\displaystyle\qquad\qquad\qquad\quad\>-\gamma\sum_{s_{ij}\in S}\left(s_{ij}\theta_{ij}-\log(1+e^{\theta_{ij}})\right)$		(11)
	$\displaystyle\IEEEeqnarraymulticol{1}{c}{s.t.\quad h_{i}=b_{i},\quad h_{i}\in\mathbb{R}^{l\times 1},\quad b_{i}=\{-1,1\}^{l}}$

where $\gamma$ is a hyper-parameter balancing the class-wise loss $L_{1}$ and centers similarity loss $L_{2}$ .

Finally, we need to handle the discrete optimization of $h_{i}$ . Follow the strategy in [6], we relax $h_{i}$ from discrete to continuous, while introducing a quantization error between $h_{i}$ and $b_{i}=\mathbf{sgn}(h_{i})$ . By integrating with the quantization term, we obtain the finalized objective function shown as following:

\begin{split}&\min_{\Theta,M}L=L_{1}+\gamma L_{2}+\beta\sum_{i=1}^{N}\left\|b_{i}-h_{i}\right\|_{2}^{2}\end{split}

(12)

where $\beta$ is a hyper-parameter controlling the weight of the quantization term.

III-C Implementation Details and Optimization

Since we train the network in mini-batch iteration, to capture the information of all the intra-class samples in the hashing codes space, we dynamically estimate the binary centers $U=\{u_{i}\}_{i=1}^{Z}$ from the current binary sum and newly sampled data. Recall (6), we denote the current binary sum of all the intra-class samples belonging to $j$ th class as $\mathbf{sum}_{j}^{(t)}$ . Then, for each iteration when new samples of $j$ th class appear, $u_{j}$ is updated as follows:

\begin{split}\mathbf{sum}_{j}^{(t+1)}&=\mathbf{sum}_{j}^{(t)}+m_{j}^{(t+1)}\\ u_{j}^{(t+1)}&=\mathbf{sgn}(\mathbf{sum}_{j}^{(t+1)})\end{split}

(13)

where $m_{j}^{(t+1)}=\sum_{k=1}^{n_{j}}\mathbf{sgn}(h_{jk})$ represents the sum of binarized hashing outputs belonging to $j$ th class in a newly-come mini-batch data.

With each forward pass in the training procedure, we obtain the hashing outputs $\{h_{i}\}$ of the network and the estimated binary centers $U$ . Then we take the derivatives of $L$ w.r.t. $h_{i}$ and $\mu_{j}$ individually and update $\mu_{j}$ by gradient descent, shown as:

\frac{\partial L}{\partial h_{i}}=\frac{\partial L_{1}}{\partial h_{i}}-2\beta(b_{i}-h_{i});\quad\frac{\partial L}{\partial\mu_{j}}=\frac{\partial L_{1}}{\partial\mu_{j}}+\gamma\frac{\partial L_{2}}{\partial\mu_{j}}

(14)

And ${\partial L_{2}}/{\partial\mu_{j}}$ can be derived as:

\frac{\partial L_{2}}{\partial\mu_{j}}=\frac{1}{2}\sum_{i=1}^{Z}u_{i}(r_{ij}-s_{ij})

(15)

where $r_{ij}=\rho(0.5u_{i}^{T}\mu_{j})$ . Then, the gradients are backpropagated to the feature learning part to update network parameters $\Theta$ . The overall learning procedure of IDCWH is summarized in Algorithm 1.

Algorithm 1 Learning Algorithm for IDCWH

Input: Training data $X=\{x_{i}\}_{i=1}^{N}$ , label matrix $Y\in\mathbb{R}^{C\times{N}}$ .
Parameter: Hyper parameters $\sigma^{2}$ , $\gamma$ and $\beta$ ; code length $L$ and batch size $bs$ .
Output: Hashing function $f(\cdot)$ , class centers $M=\{\mu_{j}\}_{j=1}^{C}$ .
Initialize network parameters $\Theta$ by Glorot distribution, class centers $M$ by normal distribution; epoch $T=0$ .

1: Repeat

T=T+1

;

U=\mathbf{0}

4: for

\lfloor\frac{N}{bs}\rfloor

iterations do

5: Randomly sample a mini-batch data

\{x_{i}\}_{i=1}^{bs}

;

6: Compute

h_{i}=f(\Theta;x_{i})

and

b_{i}=\mathbf{sgn}(h_{i})

by forward propagation;

7: Update

\{u\}_{k=1}^{Z}

for each unique class label appearing in the batch by (13);

8: Compute the loss function

L

in (12);

9: Calculate derivatives of

L

concerning

h_{i}

and

\mu_{j}

individually by (14) and (15);

10: Update

M

and

\Theta

by back propagation;

11: end for

12: Until Convergence

III-D Out-of-Sample Extension

After completing the training procedure, the learned network can generate binary hashing codes for unseen samples. We obtain the binary code $b_{test,i}$ of a given test sample $x_{i}$ by simply making a forward pass of the network and taking the binary operation, i.e.:

b_{test,i}=\mathbf{sgn}(f(\Theta;x_{i}))

(16)

IV Experiments

To verify the superiority and effectiveness of our proposed IDCWH methods, we conduct extensive experiments on three large-scale benchmark datasets, i.e. CIFAR-10, CIFAR-100 [14] and MS-COCO [15]. All the experiments are run by PyTorch with two Nvidia RTX2080 Ti GPU cards.

IV-A Experiments Settings

CIFAR-10 [14] is a standard dataset for image recognition with 60,000 images in 10 classes. The official split contains 50,000 training images with 5,000 images per class, and 10,000 testing images with 1,000 per class. We conduct two groups of experiments following different protocols as used in [6, 11, 7]. In the first protocol, we randomly choose 100 images per class (1000 images in total) as query set, and 500 images per class (5000 images in total) as training set. The rest is all used as the database. In the second protocol, all the images in the official training split are used for training and all the images in the test split are used for query.

CIFAR-100 [14] is similar to CIFAR-10 except that it has 100 classes with 600 images per class. We follow the protocol as in DCWH [10], which adopts the official training set and testing set of CIFAR-100. Specifically, we use 100 images per class (10,000 in total) as a query set and 500 images per class (50,000 in total) for training.

MS-COCO [15] is a popular dataset for image recognition, object segmentation and captioning. The official release of ‘train2014’ and ‘val2014’ contains 82,783 training images and 40,504 validation images. Each image is labelled by some of the 80 semantic concepts. Follow the protocol in [9], we combine the training and validation images together and randomly sample 5,000 images as query set and 10,000 images as a training set. The reminder images are all used as the database.

We adopt the standard similarity protocol for evaluation as in prior works. Specifically, for the single-label datasets, i.e. CIFAR-10 and CIFAR-100, images from the same class are regarded as similar, otherwise dissimilar. For the multi-label dataset, i.e. MS-COCO, any pair of images sharing at least one common label are regarded as similar, otherwise dissimilar. In addition to Mean Average Precision (MAP), four other commonly-used evaluation metrics are used in experiments, i.e. Precision-Recall curves (PR), Precision curve within two Hamming distances w.r.t. different code lengths (P@H=2), Precision curves w.r.t. different number of top returned images (P@N) and Recall curves with two Hamming distances w.r.t. different code lengths (R@H=2).

We compare our proposed method with original DCWH [10] and several classical deep supervised hashing methods including DPSH [6], DTSH [11], DSDH [7] and HashNet [9]. Besides, we also compare the performance with traditional supervised methods i.e. SDH [4] and FSDH [5], combined with deep feature learning. The two methods are denoted as SDH+CNN and FSDH+CNN. For the feature learning part, we simply employ GoogLeNet [16] with batch normalization [17] as the backbone network. The network has been pretrained on ImageNet. Since some other works apply the different network architecture with us, for the fair comparison, we re-run the experiments with the codes released by authors and we report performances of both the original works and their counterparts with GoogLeNet. The latter are marked with an asterisk, e.g. DPSH* and DSDH*.

The configurations of proposed IDCWH are shown as follows: we finetune the pretrianed GoogLeNet from the input to the last layer before the fully-connected layer and train the FC hashing layer (as shown in Fig. 2) for 150 epochs. We use mini-batch stochastic gradient descent (SGD) with momentum 0.9 and weight decay 5e-4. The initial learning rates are 1e-2 and 5e-3 for feature learning part and centers learning, respectively. And the learning rates are decayed by 0.1 every 50 epochs. The three hyperparameters adopted by cross-validation are $\sigma^{2}=4$ , $\gamma=1$ and $\beta=0.01$ . The batch size is fixed to 128 during all the experiments.

TABLE I: MAP results by Hamming ranking on CIFAR-10 under two experiment protocols

Method	CIFAR-10 (mini)				CIFAR-10 (full)
Method	12 bits	24 bits	32 bits	48 bits	12 bits	24 bits	32 bits	48 bits
SDH+CNN	0.207	0.218	0.223	0.210	0.364	0.433	0.405	0.414
FSDH+CNN	0.196	0.220	0.203	0.212	0.374	0.443	0.410	0.446
DPSH	0.713	0.727	0.744	0.757	0.763	0.781	0.795	0.807
DPSH*	0.797	0.806	0.820	0.802	0.908	0.922	0.925	0.935
DTSH	0.710	0.750	0.765	0.774	0.915	0.923	0.925	0.926
DTSH*	0.790	0.797	0.794	0.775	0.928	0.935	0.940	0.942
DSDH	0.740	0.786	0.801	0.820	0.935	0.940	0.939	0.939
DSDH*	0.800	0.802	0.804	0.808	0.913	0.925	0.943	0.930
DCWH	0.818	0.840	0.848	0.850	0.940	0.950	0.954	0.952
IDCWH	$\mathbf{0.828}$	$\mathbf{0.865}$	$\mathbf{0.868}$	$\mathbf{0.859}$	$\mathbf{0.964}$	$\mathbf{0.969}$	$\mathbf{0.967}$	$\mathbf{0.968}$

IV-B Results

We first present the evaluation on CIFAR-10 dataset. Denote the first protocol and second protocol of the experiments as CIFAR-10 (mini) and CIFAR-10 (full), respectively. The MAP results of compared methods are shown in Table I. We can see that the proposed IDCWH outperforms other methods under almost all the cases of two experiment protocols. Among them, the classwise-based deep hashing methods perform better than other pairwise/triplet labels-based methods. And IDCWH surpasses the performance of DCWH by 1-2%. Specifically, in the first protocol, it achieves the best MAP performance with 86.8% under 32-bit codes. And the best MAP result on the second protocol is 96.9% under 24-bit codes. One can find, DPSH* and DTSH* with employed GoogLeNet backbone, perform significantly better than the original works with their counterpart networks. While the performance of the combination methods of traditional supervised hashing and deep feature learning, i.e., SDH+CNN and FSDH+CNN, is really poor. It verifies the superiority of end-to-end training manner of deep supervised hashing. One can also observe, by training with more samples, the performances on the second protocol of all the methods are greatly improved compared to those of the first protocol.

The performance on CIFAR-10 under other evaluation metrics is shown in Fig. 3. It is clear that the proposed IDCWH is robust as it achieves the best result under various evaluation metrics. From Fig. 3, one can observe two common phenomenons from all the compared methods. The first is, the curve of P@H=2 always first increases then drops with the growth of code length. The second is the overall trend of R@H=2 is decreasing with the increase in code length. The reason may be explained as: with the dimension of Hamming space, the learned binary codes become sparser, which leads to fewer data points fallen into the range within two Hamming distances.

TABLE II: MAP results by Hamming ranking on CIFAR-100 dataset

Method	CIFAR-100
Method	12 bits	24 bits	32 bits	48 bits
SDH+CNN	0.0617	0.0624	0.0610	0.0668
FSDH+CNN	0.0596	0.0618	0.0650	0.0665
DPSH	0.0597	0.1008	0.1196	0.1587
DTSH	0.6070	0.7056	0.7122	0.7252
DSDH	0.0784	0.1495	0.1868	0.2272
DCWH	0.7227	0.7441	0.7570	0.7658
IDCWH	$\mathbf{0.7642}$	$\mathbf{0.8130}$	$\mathbf{0.8236}$	$\mathbf{0.8351}$

We further present the experimental results on CIFAR-100 dataset. The performance of the MAP on CIFAR-100 is presented in Table II. We can see the proposed IDCWH achieves the best results under all the compared code lengths and it surpasses the second place, i.e. DCWH with an at least 4% margin. Specifically, it greatly improves the MAP scores to more than 80% with codes length not less than 24-bit, which is nearly 7% higher than DCWH under all the cases. While for 12-bit codes, it outperforms other methods with 76.42% MAP performance.

For other pairwise/triplet labels-based methods, the performance on CIFAR-100 is much poorer than that on CIFAR-10. However, IDCWH and DCWH, both of which are based on classwise loss, can still achieve promising results. Consider the increase in the number of classes and the less number of training images in CIFAR-100. It verifies the effectiveness of classwise hashing, which can better capture the semantics information for a large-scale dataset with small inter-class variations and significant intra-class variations. The performance on other evaluation metrics, i.e. P@2, P@N and PR curves, are illustrated in Fig. 4. We can see the proposed IDCWH again preforms the best under various evaluation metrics. Observe that, there is a drastic drop on the curve of P@N when N=500, this corresponds with the fact that the number of training images per class in CIFAR-100 is 500. One can also find that, among all the pairwise/triplet labels-based methods, DTSH performs much better than DPSH and DSDH.

Finally, we summarize the evaluation on the multi-labeled MS-COCO dataset. The performance with HashNet, DHN [18] and DNNH [19] are added, which are directly taken in [9] and we also run experiments with DSDH and DPSH methods. The MAP result of all the compared methods on MS-COCO is presented in Table III. We can see IDCWH performs slightly better than DCWH under each code length and it achieves the highest MAP scores among all the methods. Recall the formulation of IDCWH, it treats each class center independently as learnable hyperparameters instead of updating the center manually. Thus, it can be extended to multi-label cases feasibly without computing the average of any new combination of multi-class centers like in DCWH. One can also observe the performances of DPSH and DSDH are poor on MS-COCO. It again validates our expectation that, by leveraging the classwise difference between all the samples, it helps to generate more semantics-preserving binary codes.

TABLE III: MAP results by Hamming ranking on MS-COCO dataset

Method	MS-COCO
Method	16 bits	32 bits	48 bits	64 bits
DPSH	0.3493	0.3545	0.3595	0.3670
DSDH	0.3470	0.3587	0.3661	0.3703
DNNH	0.5932	0.6034	0.6045	0.6099
DHN	0.6774	0.7013	0.6948	0.6944
HashNet	0.6873	0.7184	0.7301	0.7362
DCWH	0.7227	0.7441	0.7570	0.7658
IDCWH	$\mathbf{0.7321}$	$\mathbf{0.7597}$	$\mathbf{0.7636}$	$\mathbf{0.7698}$

IV-C Empirical Analysis

IV-C1 Visualization of Hashing Codes

To intuitively compare the capability of different methods, we visualize the generated 48-bit hashing codes by DTSH, DCWH and IDCWH on CIFAR-100 dataset. For a more comprehensive comparison, we also visualize the original 1024-dimensional features extracted from deep CNNs. We first randomly sample 10 categories out of 100 categories in CIFAR-100, then generate the binary codes of all the testing images belonging to the 10 categories. Finally, t-SNE [20] is used to map the 48-bit codes/ 1024-dimensional features to 2-dimensional features for visualization. The results are shown in Fig. 5. We observe that the hashing codes generated by IDCWH are most compact and discriminative. While the hashing codes generated by DTSH and DCWH are not well separated, which also reflects in the ratio of within-class distances and between-class distances of each method.

IV-C2 Ablation Study

We conduct experiments to investigate the effect of the proposed centers similarity learning. Denote the finalized loss function without using $L_{2}$ shown in (10) as IDCWH-Single. The performance comparison of two configurations on three datasets is shown in Table IV. We can see IDCWH steadily performs better than IDCWH-Single under all the cases. It validates the functionality of centers similarity learning, which contributes to make the class center more concentrate on the intra-class samples while pushing other class centers as far as possible. Another observation from Table IV is, the contribution of $L_{2}$ is more distinct on shorter code lengths than the relatively longer code lengths.

TABLE IV: Comparison on MAP with single classwise loss and the combination with centers similarity loss

Method	CIFAR-10 (mini)				CIFAR-100				MS-COCO
Method	12-bit	24-bit	32-bit	48-bit	12-bit	24-bit	32-bit	48-bit	16-bit	32-bit	48-bit	64-bit
IDCWH-Single	0.7944	0.8239	0.8416	0.8420	0.7015	0.7393	0.7638	0.7701	0.6963	0.7125	0.7490	0.7529
IDCWH	$\mathbf{0.8284}$	$\mathbf{0.8650}$	$\mathbf{0.8681}$	$\mathbf{0.8586}$	$\mathbf{0.7642}$	$\mathbf{0.8130}$	$\mathbf{0.8236}$	$\mathbf{0.8351}$	$\mathbf{0.7321}$	$\mathbf{0.7597}$	$\mathbf{0.7636}$	$\mathbf{0.7698}$

IV-C3 Parameters Sensitivity Analysis

We further present the investigation on the influence of hyperparameters on the IDCWH. We focus on two hyperparameters in our work, i.e. $\sigma^{2}$ which controls the variance of intra-class samples to the corresponding class center and $\beta$ which penalizes the quantization error. Specifically, we test various $\sigma^{2}$ ranging within $\{0.5,1,2,4,8\}$ and $\beta$ ranging with $\{0.001,0.01,0.1\}$ . The experiments are conducted on CIFAR-100 dataset and the result is illustrated in Fig. 6. From Fig. 6, we observe the performances of $\sigma^{2}$ and $\beta$ share similar patterns with the growth on the values, i.e. first rises then descends. Thus, we choose $\sigma^{2}=4$ and $\beta=0.01$ in this paper, which should be a promising configuration.

V Conclusion

This paper addressed the limitation of prior works on deep supervised hashing and proposed an improved framework with centers similarity learning. To achieve this, firstly, it clusters intra-class samples to the learnable class centers and dynamically estimates binary centers within each iteration. To enhance the hashing codes discriminativity, it then minimizes the Hamming distance from the learnable class center to the corresponding estimated binary center while maximizing other estimated centers’ distance. Experiments on three benchmark datasets verify that the proposed method can effectively boost original deep classwise hashing and yield state-of-the-art retrieval results.

Acknowledgment

This work is supported by Hong Kong Research Grants Council (Project C1007-15G) and City University of Hong Kong (Projects 9610034 and 9610460).

References

[1] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in Advances in neural information processing systems, 2009, pp. 1753–1760.
[2] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin, “Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 12, pp. 2916–2929, 2012.
[3] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang, “Supervised hashing with kernels,” in Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, 2012, pp. 2074–2081.
[4] F. Shen, C. Shen, W. Liu, and H. Tao Shen, “Supervised discrete hashing,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 37–45.
[5] J. Gui, T. Liu, Z. Sun, D. Tao, and T. Tan, “Fast supervised discrete hashing,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 2, pp. 490–496, 2017.
[6] W.-J. Li, S. Wang, and W.-C. Kang, “Feature learning based deep supervised hashing with pairwise labels,” in Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence. AAAI Press, 2016, pp. 1711–1717.
[7] Q. Li, Z. Sun, R. He, and T. Tan, “Deep supervised discrete hashing,” in Advances in neural information processing systems, 2017, pp. 2482–2491.
[8] Y. Chen, Z. Lai, Y. Ding, K. Lin, and W. K. Wong, “Deep supervised hashing with anchor graph,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 9796–9804.
[9] Z. Cao, M. Long, J. Wang, and P. S. Yu, “Hashnet: Deep learning to hash by continuation,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 5608–5617.
[10] X. Zhe, S. Chen, and H. Yan, “Deep class-wise hashing: Semantics-preserving hashing via class-wise loss,” IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 5, pp. 1681–1695, 2020.
[11] X. Wang, Y. Shi, and K. M. Kitani, “Deep supervised hashing with triplet labels,” in Asian conference on computer vision. Springer, 2016, pp. 70–84.
[12] H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese, “Deep metric learning via lifted structured feature embedding,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4004–4012.
[13] P. Zhang, W. Zhang, W.-J. Li, and M. Guo, “Supervised hashing with latent factor models,” in Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, 2014, pp. 173–182.
[14] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.
[15] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision. Springer, 2014, pp. 740–755.
[16] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
[17] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International Conference on Machine Learning, 2015, pp. 448–456.
[18] H. Zhu, M. Long, J. Wang, and Y. Cao, “Deep hashing network for efficient similarity retrieval,” in Thirtieth AAAI Conference on Artificial Intelligence, 2016.
[19] H. Lai, Y. Pan, Y. Liu, and S. Yan, “Simultaneous feature learning and hash coding with deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3270–3278.
[20] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of machine learning research, vol. 9, no. Nov, pp. 2579–2605, 2008.