This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Leaner and Faster: Two-Stage Model Compression for Lightweight Text-Image Retrieval

Siyu Ren           Kenny Q. Zhu
Shanghai Jiao Tong University
Shanghai, China
[email protected], [email protected]
  The corresponding author.
Abstract

Current text-image approaches (e.g., CLIP) typically adopt dual-encoder architecture using pre-trained vision-language representation. However, these models still pose non-trivial memory requirements and substantial incremental indexing time, which makes them less practical on mobile devices. In this paper, we present an effective two-stage framework to compress large pre-trained dual-encoder for lightweight text-image retrieval. The resulting model is smaller (39% of the original), faster (1.6x/2.9x for processing image/text respectively), yet performs on par with or better than the original full model on Flickr30K and MSCOCO benchmarks. We also open-source an accompanying realistic mobile image search application.

1 Introduction

Text-image retrieval is the task aiming at retrieving a list of relevant images from a large set of images given a textual query specified by the user. Recently, large-scale vision-language pretraining (VLP) has spawned models Tan and Bansal (2019); Li et al. (2020); Radford et al. (2021) that established state-of-the-art results in various vision-language tasks Antol et al. (2015); Suhr et al. (2019), including text-image retrieval. Existing VLP models for text-image retrieval can be divided into two categories: cross-encoder architecture and dual-encoder architecture. Cross-encoder models show better retrieval accuracy by allowing fine-grained cross-modal attention among image and text. However, they are prohibitively slow to apply to the entire image pool because each image has to go through the deep Transformer again whenever a new text query comes in. Moreover, most cross-encoder models rely on external object detection models Ren et al. (2015) to extract visual features, which further increase memory consumption. On the other hand, dual-encoder models are more scalable in that they allow pre-computing image representations as reusable vectors independent of the text queries. These image vectors can be indexed and efficiently retrieved at runtime using Approximate Nearest Neighbor (ANN) search Johnson et al. (2017). As long as the image pool remains unchanged, the image encoder is not required.

Refer to caption
Figure 1: In stage-1 (Section 3.2.1), we perform intra-modal contrastive knowledge distillation. In stage-2 (Section 3.2.2), we sequentially fine-tune fvSf_{v}^{S} and ftSf_{t}^{S} with knowledge distillation (KD) and corpus-level hard negative (HN) mining via pre-computed index. The total loss LL is the sum of Lt2vL_{t2v}, Lv2tL_{v2t}, LKDL_{KD}, and LHNL_{HN}. The thin black arrows represent the input/output flows and the solid green arrows indicate the gradient flows.

However, a more practical scenario calls for dynamic indexing of new images into the pool (e.g., private photo collections on mobile devices), which requires both the image encoder and the text encoder to be resident in memory. This makes the above approach less practical on mobile devices with limited memory and processing power. Unfortunately, little attention has been paid to fulfill this need. In this paper, we show that a large dual-encoder model can be compressed into a much smaller and faster counterpart while retaining its retrieval accuracy using a novel two-stage compression framework. In the first stage, we make use of abundant non-paired texts/images to separately compress text or image encoder with an effective intra-modal contrastive knowledge distillation scheme. In the second stage, we sequentially fine-tune the distilled image or text encoder on paired text-image data with comprehensive learning objectives. Using CLIP Radford et al. (2021) as the target model, our compressed models deliver comparable performance on MSCOCO and Flickr30K while being just 39% of the original size and 1.6x/2.9x times faster for processing image/text. Detailed ablation study shows the effectiveness of each component in the compression framework and their synergistic effects.

Our contributions are three-folds: 1) an effective compression framework tailored for lightweight text-image retrieval; 2) a leaner and faster model with competitive accuracy; 3) open-sourced models and text-to-image search mobile applications on both iOS and Android at https://github.com/DRSY/MoTIS.

2 Related Work

Cross-encoder. Cross-encoder architecture Tan and Bansal (2019); Chen et al. (2019); Li et al. (2020) adopts a single Transformer network Vaswani et al. (2017) which is able to process inputs from different modalities, e.g., images and texts. Benefitting from the self-attention mechanism, the hidden states of images and texts interact with each other at the patch/token-level, therefore yielding state-of-the-art retrieval accuracy. Though effective, these models suffer from huge memory consumption and inference latency, making them inpractical in time-sensitive real-world scenarios.

Dual-encoder. In contrast to cross-encoder, dual-encoder architecture Radford et al. (2021); Jia et al. (2021) trains two seperate encoders for vision and language modalities. The exact choices of encoder architecture may be different. For example, CLIP utilizes Transformers for both visual and text encoders, while ALIGN Jia et al. (2021) uses pre-trained BERT as text encoder and EfficientNet as visual encoder. In dual encoder, interactions between different modalities take place only at the final encoder layer, resulting in slightly worse performance compared to cross-encoders. Nevertheless, this late-interaction scheme of dual-encoder allows for efficient similarity computation, thus rendering it suitable for prividing real-time searching.

3 Approach

3.1 Background on Dual-Encoder

Dual-encoder architecture employs two separate neural networks to encode inputs from different modalities and map them to a shared space.

We denote the image encoder as fvf_{v} and the text encoder as ftf_{t} in the context of text-image retrieval. To train fvf_{v} and ftf_{t}, it is common to adopt an objective that pushes the embeddings of matched text-image pairs closer while pushing those of non-matched text-image pairs apart. Specifically, Contrastive Language-Image Pretraining (CLIP) Radford et al. (2021) optimizes an InfoNCE van den Oord et al. (2018) loss:

t2v=1Ni=1Nlogeft(xi)fv(yi)/τj=1Neft(xi)fv(yj)/τ\displaystyle\mathcal{L}_{t2v}=-\frac{1}{N}\sum_{i=1}^{N}\log{\frac{e^{f_{t}(x_{i})^{\top}f_{v}(y_{i})/\tau}}{\sum_{j=1}^{N}e^{f_{t}(x_{i})^{\top}f_{v}(y_{j})/\tau}}} (1)

Here, ft(xi)f_{t}(x_{i}) and fv(yj)f_{v}(y_{j}) are the L2-normalized embeddings of text in the ii-th pair and image in the jj-th pair. NN is the mini-batch size and τ\tau is the temperature to scale the logits. The final objective is the sum of Lt2vL_{t2v} and its symmetric version Lv2tL_{v2t}.

3.2 Two-Stage Model Compression

Despite good retrieval accuracy, models like CLIP still pose non-trivial memory footprint and inference time, which is undesirable for low-resource devices such as smart phones.

To tackle this issue, we propose a two-stage compression framework to make large dual-encoder model smaller and faster while retaining its accuracy. A schematic overview is illustrated in Figure 1. The first stage is task-agnostic, where we leverage massively available non-paired texts/images to separately compress the text/image encoder using an intra-modal contrastive knowledge distillation scheme. The second stage is task-specific, where we sequentially fine-tune the distilled image and text encoder using a combination of multiple techniques. We denote the image and text encoder of the large dual-encoder as fvTf_{v}^{T} and ftTf_{t}^{T} and those of the compressed model as fvSf_{v}^{S} and ftSf_{t}^{S}.

3.2.1 Stage-1

The extremely large scale of text-image pairs (e.g., 400 million used to train CLIP) makes it possible to make up for the noise in data and train over-parametrized large dual-encoder (i.e., fvTf_{v}^{T} and ftTf_{t}^{T}) from scratch to learn aligned visual and language representations. However, it is difficult to train small model (i.e., fvSf_{v}^{S} and ftSf_{t}^{S}) with lower capacity using the same inter-modal learning scheme.

To circumvent this issue, we propose to exploit massively available non-paired data from the web and optimize an intra-modal contrastive objective that aligns the output embeddings of fSf^{S} and pretrained fTf^{T}, which can be seen as a form of knowledge distillation Hinton et al. (2015). Here we take visual modality as an example. Given a collection of images {yi}i=1N\{y_{i}\}_{i=1}^{N}, we feed them to both fvSf_{v}^{S} and fvTf_{v}^{T} to produce two sets of image embeddings {fvS(yi)}i=1N\{f_{v}^{S}(y_{i})\}_{i=1}^{N} and {fvT(yi)}i=1N\{f_{v}^{T}(y_{i})\}_{i=1}^{N}. Then we optimize the following contrastive objective for updating fvSf_{v}^{S}:

v2v=1Ni=1NlogefvS(yi)fvT(yi)/τj=1NefvS(yi)fvT(yj)/τ\displaystyle\mathcal{L}_{v2v}=-\frac{1}{N}\sum_{i=1}^{N}\log{\frac{e^{f_{v}^{S}(y_{i})^{\top}f_{v}^{T}(y_{i})/\tau}}{\sum_{j=1}^{N}e^{f_{v}^{S}(y_{i})^{\top}f_{v}^{T}(y_{j})/\tau}}} (2)

The same formulation is symmetrically applied to language modality to obtain Lt2tL_{t2t} for updating ftSf_{t}^{S}:

t2t=1Ni=1NlogeftS(xi)ftT(xi)/τj=1NeftS(xi)ftT(xj)/τ\displaystyle\mathcal{L}_{t2t}=-\frac{1}{N}\sum_{i=1}^{N}\log{\frac{e^{f_{t}^{S}(x_{i})^{\top}f_{t}^{T}(x_{i})/\tau}}{\sum_{j=1}^{N}e^{f_{t}^{S}(x_{i})^{\top}f_{t}^{T}(x_{j})/\tau}}} (3)

Essentially, fvSf_{v}^{S}/ftSf_{t}^{S} is trained to recover the representation power of fvTf_{v}^{T}/ftTf_{t}^{T} in a decoupled manner.

3.2.2 Stage-2

After training fvSf_{v}^{S} and ftSf_{t}^{S} using general-domain data, it is necessary to adapt the learned representations to downstream tasks using in-domain data. First, we fine-tune fvTf_{v}^{T} and ftTf_{t}^{T} on paired text-image data D={(xi,yi)}i=1ND=\{(x_{i},y_{i})\}_{i=1}^{N} using standard InfoNCE loss (Section 3.1). In the experiments, we found that jointly fine-tuning image and text encoder results in retrieval performance even worse than no fine-tuning at all. Therefore, we choose to sequentially fine-tune fvTf_{v}^{T}/ftTf_{t}^{T} by fixing the other one. The resulting fine-tuned encoders are denoted as fvTf_{v}^{T^{\prime}} and ftTf_{t}^{T^{\prime}} and are henceforth kept fixed. Next, for training fvSf_{v}^{S} and ftSf_{t}^{S}, we propose several techniques essential to successful compression:

Knowledge Distillation (KD). In addition to the standard InfoNCE loss, we design two kinds of knowledge distillation objectives to learn from fvTf_{v}^{T^{\prime}} and ftTf_{t}^{T^{\prime}}. One is the Kullback-Leibler divergence between image-text matching distribution predicted by fvTf_{v}^{T^{\prime}} and ftTf_{t}^{T^{\prime}} and the one predicted by fvSf_{v}^{S} and ftSf_{t}^{S}. This resembles previous response-based knowledge distillation Hinton et al. (2015). The other is the same contrastive objective defined in Section 3.2.1. It indirectly encourages the alingment between visual and language representations.

Sequential Finetuning (SF). Similar to how we get fvTf_{v}^{T^{\prime}} and ftTf_{t}^{T^{\prime}}, we also fine-tune fvSf_{v}^{S} and ftSf_{t}^{S} in a sequential manner. Concretely, we first let the compressed model share the same text encoder with the target dual-encoder and only fine-tune its image encoder. After that, we then fix the image encoder and fine-tune its text encoder in the same way.

Hard Negative Mining (HN). Prior works on contrastive representation learning Chen et al. (2020); Gao et al. (2021) typically exploit in-batch negative samples. Though efficient, image-text pairs in a batch are randomly sampled and are likely to be trivially unrelated. Models trained in such a way may fail in cases where candidates are similar. To achieve more accurate retrieval, we mine hard negatives from the entire corpus. In our sequential fine-tuning setting, we first use ftTf_{t}^{T^{\prime}} to compute embeddings of all texts in the corpus and index them with Faiss Johnson et al. (2017). During training fvSf_{v}^{S}, for each image yiy_{i} we use fvS(yi)f_{v}^{S}(y_{i}) as query to the index and obtain its top-k texts as negative samples. Afterward, we use the trained fvSf_{v}^{S} to compute embeddings of all images in the corpus and build the index. During training ftSf_{t}^{S}, for each text xix_{i} we use ftS(xi)f_{t}^{S}(x_{i}) as query to the index and get its top-k images as negative samples.

The complete training objective of stage-2 is defined as =t2v+v2t+KD+HN\mathcal{L}=\mathcal{L}_{t2v}+\mathcal{L}_{v2t}+\mathcal{L}_{KD}+\mathcal{L}_{HN}.

4 Experiment

Image Text MSCOCO (1K) MSCOCO (5K) Flickr 30K
R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
fvTf_{v}^{T} ftTf_{t}^{T} 46.9 77.3 87.3 28.0 52.9 64.5 55.2 80.3 87.8
fvTf_{v}^{T^{\prime}} ftTf_{t}^{T\prime} 61.0 87.9 94.7 40.9 67.6 77.9 58.0 82.3 89.1
fvSf_{v}^{S} ftTinyBERTf_{t}^{\text{TinyBERT}} 41.4 76.7 88.1 21.3 47.2 61.0 30.2 59.1 71.2
fvSf_{v}^{S} ftS4f_{t}^{S_{4}} 62.0 88.0 94.4 42.0 69.2 79.0 55.0 81.3 88.4
fvSf_{v}^{S} ftS6f_{t}^{S_{6}} 62.7 88.2 94.5 42.6 69.6 79.4 57.0 82.1 88.8
Table 1: Comparisons of text-image retrieval results on MSCOCO (1K and 5K) and Flickr 30K.
Image MSCOCO (5K) Δ\Delta
R@1 R@5 R@10 R@1
fvSf_{v}^{S} 36.7 64.6 75.3 -
w/o stage-1 32.6 59.6 70.7 -4.1
stage-1MSE{}_{\text{MSE}} 22.6 46.7 58.5 -14.1
stage-1InfoNCE{}_{\text{InfoNCE}} 31.7 58.5 69.6 -5.0
w/o SF 30.9 57.6 70.8 -5.8
w/o KD 35.8 63.1 74.2 -0.9
w/o HN 34.4 62.0 73.7 -2.3
w/o KD+HN 32.6 60.3 71.9 -4.1
Table 2: Ablation on design choices in both stages.
Image Text Disk Space QPSv QPSt
MB # #
fvTf_{v}^{T^{\prime}} ftTf_{t}^{T\prime} 578 1.00x 1.00x
fvSf_{v}^{S} ftS6f_{t}^{S_{6}} 255 1.51x 1.98x
fvSf_{v}^{S} ftS4f_{t}^{S_{4}} 230 1.51x 2.77x
Table 3: Comparisons of disk space and QPS.

4.1 Setup

Dataset. We use Conceptual Caption Sharma et al. (2018) for stage-1 compression. It consists of 3M noisy image alt-text pairs. However, we do not use the image-text alignment information but only treat it as a reservoir of general-domain images and texts. In stage-2, we use MSCOCO Lin et al. (2014) and Flickr30K Plummer et al. (2015) as the benchmarks. For MSCOCO, there are 113,287 images for training, 5,000 images for validation, and both 5K and 1K for testing. For Flickr30K, there are 28,783 images for training, 1,000 images for validation, and 1k for testing.

Evaluation Metrics. Following previous work, we use recall R@K (K=1,5,10) as the main metric of task performance. We also report the disk space (MB) and how many image/text queries can be encoded per second (QPSv for image and QPSt for text) to evaluate model’s memory footprints and inference speed.

Target Model. We use the open-sourced ViT-B/32 CLIP as the target dual-encoder model to compress. The image encoder fvTf_{v}^{T} is a 12-layer Vision Transformer Dosovitskiy et al. (2020) with 768 hidden dimension and 12 attention heads. The text encoder ftTf_{t}^{T} is a 12-layer Transformer with 512 hidden dimention and 8 attention heads. Note that this is the largest publically available version according to OpenAI’s official repository.

Compression Configuration. For image encoder fvSf_{v}^{S}, we use a ViT-S/16 with 384 hidden dimension. We initialize it with weights pretrained on ImageNet-21K Ridnik et al. (2021)for faster convergence and better performance. For text encoder ftSf_{t}^{S}, we experiment with both 6-layer and 4-layer Transformer (marked as ftS6f_{t}^{S_{6}} and ftS4f_{t}^{S_{4}}), of which the weights are initialized from corresponding layers in ftTf_{t}^{T}. We also compare with a baseline compression method that directly fine-tunes pre-trained ViT-S/16 and 4-layer TinyBERT Jiao et al. (2019) ftTinyBERTf_{t}^{\text{TinyBERT}} using InfoNCE loss throughout both stages.

Implementation Detail. In stage-1, we train 1 epoch using AdamW Loshchilov and Hutter (2017) with a batch size of 84 for both images and texts, learning rate of 3e-4, and weight decay of 0.1. In stage-2, we use the same optimization setting except that we train with batch size 96 for 5 epochs. We employ a cosine learning rate scheduler with 10,000 warm-up steps for both stages. All reported results are calculated on the test set using checkpoints with the highest validation performance.

4.2 Results

Main Results. Table 1 summarizes the main results. As can be observed, the pre-trained CLIP model can already deliver moderately good retrieval performance. The performance is further improved after fine-tuning. Fine-tuning pre-trained ViT-S/16 and TinyBERT underperforms the zero-shot CLIP, showing that training with inter-modal InfoNCE is not effective without extremely large-scale paired data. On most evaluation metrics, models compressed by our proposed two-stage pipeline perform on par with or better than the fine-tuned target model. We also found that the capacity of text encoder has limited affect on the performance.

Ablation Study. We perform extensive ablations to study the importance of each proposed technique. Due to the computational budget, we only conduct ablation on the image encoder and fix the text encoder as ftTf_{t}^{T^{\prime}}. We evaluate w/o stage-1, stage-1MSE (mean-square-error between fvSf_{v}^{S} and fvTf_{v}^{T}), and stage-1InfoNCE (identical to the loss in Section 3.1) for stage-1 ablation. We also study the effectiveness of KD/SF/HN by removing them separately or together. We made several observations based on Table 2: 1) SF makes fine-tuning stable and is essential for convergence. 2) both KD and HN improve retrieval accuracy and are complementary to each other. 3) intra-modal contrastive distillation helps when image-text pairs are noisy and outperforms inter-modal infoNCE loss.

Efficiency. In Table 3, we compare the disk space and QSP used by models on a RTX 2080Ti of 12GB memroy. The compressed image encoder fvSf_{v}^{S} takes 85MB disk space (39% of fvTf_{v}^{T}) meanwhile being 1.51x times faster. Our compressed text encoder can achieve up to x2.77 inference speed-up and 40% size reduction (from 243MB to 146MB). We further benchmark models’ memory and run-time performance on a real iPhone X with 1,000 images in the gallery for testing. It takes 870MB and 295MB for loading CLIP and our compressed model into main memory respectively. After indexing, the response time for a single text query is 0.4s for CLIP while it is only 0.1s for our compressed model. Although the results are hardward-dependent, our compressed model still shows an evident improvement in efficiency.

5 Conclusion

In this paper, we present a two-stage framework for lightweight text-image retrieval. Experiments on two benchmarks show the effectiveness of each component in the framework and the best performance is achieved when combining them together. It holds the merit of reducing model size and accelerating inference time, making memory/response-sensitive applications more practical.

Acknowledgement

This research is partially supported by NSFC Grant No. 91646205, and SJTU-CMBCC Joint Research Scheme.

References