Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images with
Free ATtention Masks

David Junhao Zhang¹, Mutian Xu^2,4,
Chuhui Xue³, Wenqing Zhang³, Xiaoguang Han^2,4, Song Bai³, Mike Zheng Shou¹

¹ Show Lab, National University of Singapore ² SSE, CUHKSZ
³ ByteDance ⁴ FNii, CUHKSZ
Work is partially done during an internship at ByteDance.Corresponding Author.

Abstract

Despite the rapid advancement of unsupervised learning in visual representation, it requires training on large-scale datasets that demand costly data collection, and pose additional challenges due to concerns regarding data privacy. Recently, synthetic images generated by text-to-image diffusion models, have shown great potential for benefiting image recognition. Although promising, there has been inadequate exploration dedicated to unsupervised learning on diffusion-generated images. To address this, we start by uncovering that diffusion models’ cross-attention layers inherently provide annotation-free attention masks aligned with corresponding text inputs on generated images. We then investigate the problems of three prevalent unsupervised learning techniques (i.e., contrastive learning, masked modeling, and vision-language pretraining) and introduce customized solutions by fully exploiting the aforementioned free attention masks. Our approach is validated through extensive experiments that show consistent improvements in baseline models across various downstream tasks, including image classification, detection, segmentation, and image-text retrieval. By utilizing our method, it is possible to close the performance gap between unsupervised pretraining on synthetic data and real-world scenarios.

1 Introduction

Unsupervised learning is a type of machine learning where models learn to identify patterns or structures in data without explicit labels. In the past few years, several unsupervised learning techniques have emerged, including contrastive learning [1, 2, 3, 4], masked modeling [5, 6], and vision-language pretraining [7, 8, 9], etc. Although these advancements have led to significant progress in visual representation learning, the majority of them rely on pretraining on large-scale datasets, such as ImageNet [10] which contains millions of images. However, manually building a sizable dataset with decent richness and diversity is often time-consuming and costly. Moreover, present-day concerns about data privacy and usage rights further complicated the acquisition of massive data [11], creating additional obstacles to the development of unsupervised learning.

To overcome these challenges, using synthetic data for unsupervised pretraining presents itself as a logical option, given its advantageous characteristics such as cost-effectiveness, virtually limitless scalability, enhanced control over data distribution, and improved data privacy and security.

In the computer vision area, there have been some attempts that leverage synthetic data for image recognition tasks. Besnier et al. [12] and Zhao et al. [13] both employ BigGAN [14] to produce informative images for training image classifiers. DatasetGAN [15] and BigDatasetGAN [16] adopt StyleGAN [17] and BigGAN [14] for generating images with pixel-wise labels for segmentation tasks. In addition to using GANs, He et al. [18] finds that the revolutionary text-to-image diffusion models such as GLIDE [19] can generate not only high-quality but also diverse images in a customized label space for benefiting image recognition. This recent study is noteworthy for firstly demonstrating promising results of image understanding using diffusion-generated images. Albeit promising, there has been a lack of in-depth exploration focusing on unsupervised learning on diffusion-generated data. We attempt to remedy this defect from the perspective of both diffusion-generated data and unsupervised models.

i) In contrast to class-specific GANs that can only generate images of individual objects, text-to-image diffusion models have the capability to produce diverse images featuring multiple objects by utilizing different text tokens. More importantly, we find that the cross-attention layers of diffusion models naturally provide semantic attention masks aligned with corresponding text inputs on generated images without any manual annotations (also indicated in [20, 21]), which helps to locate each foreground object as shown in Fig. 1. ii) Next, in view of unsupervised models, we examine the shortcomings of three commonly used unsupervised learning schemes and investigate how the attention masks of diffusion-generated images can be used as a free resource to enhance unsupervised learning on synthetic data.

Refer to caption — Figure 1: Visualization of diffusion-generated images with freely available attention masks.

Contrastive Learning (CL) aims to learn representations that can distinguish between positive pairs (similar examples) and negative pairs (dissimilar examples). Conventional techniques [1, 2, 3, 4, 22] treat an image with a single object as a complete entity and conduct random crop and augmentations to get positive pairs at the image level. Yet, such a paradigm is not suitable for diffusion-generated images containing multiple objects. As discussed earlier, the diversity of diffusion-generated images, which often include multiple instances, presents a significant challenge for traditional random crops in contrastive learning. This is due to the high risk of positive pairs being originated from distinct instances, resulting in ambiguity in model training as the discriminate features of each instance are pulled in (see top row of Fig. 2 (a)). To mitigate this issue, we leverage the free attention masks from diffusion generators to ensure that each positive pair comes from the same object. Meanwhile, negative pairs are formed by selecting different instances of images based on their corresponding masks (see bottom row of Fig. 2 (a)). Our approach aids in the acquisition of precise information by the CL model.

Masked Modeling (MM) recently achieves widespread success in various vision challenges by reconstructing masked visual patches [5, 6, 23]. A new study [24] proposes to identify the patches which are hard to reconstruct, and the study’s outcomes reveal that such patches are consistently located within foreground objects. By restoring these particular patches rather than randomly-masked ones, the network can acquire more focused features. Inspired by this, as indicated in Fig. 2 (b), we employ the freely available attention map from diffusion generators and present a balanced masking technique for masked modeling that gradually increases the masking ratio of foreground object patches. Our strategy enables the MM model to learn both universal and targeted representations.

Vision-and-Language Pretraining (VLP) is developed to jointly pretrain visual and language features using image-text matching by learning to align and translate between the two modalities [7, 8, 9]. VLP models predominantly rely on position features, such as those belonging to the objects of interest in an image, to gain a better understanding of the relationships between words and objects. Nonetheless, locating these features through the use of bounding boxes predicted by object detection algorithms can be a time-consuming process. Additionally, the quality of the position features is heavily dependent on the performance of the object detectors, which may ultimately limit the strength of VLP models. Fortunately, the attention maps from text-to-image diffusion models naturally align each text prompt with its corresponding object position. As shown in Fig. 2 (c), we apply attention masks to supply position information without requiring the extra step of object detection. This not only brings greater efficiency to VLP models, but also enhances their overall effectiveness.

Our proposed frameworks, which rely on annotation-Free ATtention Masks from diffusion generators, are collectively referred to as Free-ATM. Extensive experiments show that our method of conducting unsupervised pretraining on diffusion-generated data consistently improves the baseline performance on large-scale real-world benchmarks, including PASVOC [25], COCO [26], Cityspace [27], and ADE20K [28], across various downstream tasks like image classification, detection, segmentation, and image-text retrieval. Besides, our Free-ATM achieves significantly better results compared to simply applying original unsupervised pretraining protocols on diffusion-generated data. By leveraging Free-ATM, the performance gap between unsupervised pretraining on synthetic data and real-world scenarios can possibly be closed. Moreover, mixing synthetic data with real-world data for pretraining can further boost performance.

2 Related Work

Text-to-Image Diffusion Models. Diffusion models [29, 30, 31, 32, 33] have been making waves in the generative model’s landscape, primarily due to their unique approach to generating new data samples. These models commence with a straightforward random noise and progressively denoise it through numerous steps with learned transformations until it mirrors a sample from the desired data distribution. Recently, large-scale text-to-image diffusion models such as Stable Diffusion [34], Imagen [35], and GLIDE [19] have made considerable strides and produced striking visual outcomes. Rombach et al. proposed an approach where the diffusion process takes place in the latent space, utilizing a UNet [36] to predict noise and a VAE [37] decoder to convert the latent feature into pixel space. This approach streamlined the text-to-image diffusion process and accelerated it, becoming a popular choice for diffusion models. Building on the latent diffusion model, Hertz et al. [20] discover that the cross-attention map, which is employed for text-visual interaction in UNet, can accurately represent the foreground object when given the appropriate prompt. They leveraged this characteristic to manipulate images as per requirements by directly modifying the attention map. Similarly, Zhao et al. [21] use the UNet of the diffusion models as the core structure and extracted the cross-attention map to boost the zero-shot segmentation performance.

In our study, we use the latent diffusion model to generate images in line with the text prompts and to extract the cross-attention map. Using human annotation-free sources, we investigate how attention maps, serving as an additional resource, can enhance unsupervised learning on synthetic images.

Synthetic Data from Generative Models. Generative Adversarial Networks (GANs) [38] have the capability to produce highly realistic and superior quality images. There are numerous studies that utilize GANs to synthesize datasets akin to ImageNet [10]. Li et al. [16] present BigdatasetGAN, a method that generates a vast amount of images corresponding to ImageNet classes with pixel-level labels, but it necessitates additional human annotation. Jahanian et al. [39] propose a unique method that employs a GAN to generate multiple views of an image, which are then used as positive pairs for contrastive learning. Recently, diffusion models have been gaining precedence in the field of image generation. He et al. [18] demonstrate that innovative text-to-image diffusion models, such as GLIDE [19], can produce data in a custom label space for image recognition. Furthermore, works by Azizi et al. [40] and Mert Bulent et al. [41] employ Imagen [35] and Stable Diffusion [34] to generate images with class labels, thereby improving supervised classification performance. Adding to this, Brandon et al. utilize diffusion model inversion to augment images for classifying small datasets.

Rather than focusing on task-specific synthetic data manipulation, we delve into the broader realm of unsupervised learning on synthetic images. Notably, we leverage the overlooked attention masks, which can offer pixel-level labels without any need for human annotations, thus adding a new dimension to unsupervised learning.

Unsupervised Learning. (i) Contrastive Learning is built upon the foundational idea of drawing positive pairs nearer while distancing negative pairs in the representational space. This method has proven to be effective in learning visual representations without the need for labeled data [42, 43, 44, 45, 46, 47, 48, 49]. A significant advancement in this area is the SimCLR framework [1], which has substantially improved the quality of the learned representations via a non-linear transformation head. MoCo [4], another impactful work, maintains a memory bank for a vast array of negative samples, and employs a momentum-based method for gentle updates, ensuring improved consistency during learning. (ii) Masked Modeling [5, 50, 51, 52, 23], such as MAE [5], SimMIM [6], and iBOT [50], use masked patch reconstruction in combination with basic data augmentation to effectively learn robust representations. (iii) Vision-Language Pre-training (VLP), exemplified by models like CLIP [7], BLIP [8], and ViLT [9], seeks to boost performance on downstream vision and language tasks by pretraining the model on large-scale image-text pairs to align visual and textual features without any task-specific supervision.

In this work, we propose tailored approaches that leverage the freely available attention masks from diffusion generators to enhance the above-mentioned three unsupervised learning frameworks.

3 Method

3.1 Attention Mask from Text-to-Image Diffusion Model

Latent Text-to-Image Diffusion model [34] is a generative model that uses a text prompt to create high-quality images through a controlled diffusion process. It first encodes the text prompt into a latent space representation, then uses a diffusion process to gradually transform a noise input into the final image, guided by the encoded text. The model represents the joint understanding of textual and visual data via cross-attention interaction in UNet. Specifically, in the single diffusion step and layer, the text embedding is projected into key as $K\in\mathbb{R}^{L\times C}$ and the visual noise is projected into query as $Q\in\mathbb{R}^{H\times W\times C}$ , where $L$ is the text sequence length, $H,W$ are height and width of visual feature and $C$ is the feature dimension. The cross-attention map is achieved by the multiplication of $Q$ and $K$ , resulting in

A=\text{Softmax}\left(\frac{QK^{T}}{\sqrt{d}}\right),

(1)

where $A\in\mathbb{R}^{H\times W\times L}$ stands as the attention map, illustrating the relationship between textual and visual elements. The attention map $a\in\mathbb{R}^{H\times W}$ , corresponding to specific nouns such as ‘dog’ in an $L$ length sentence, is selected. Following this, we compile the attention maps derived from every layer and time step within the diffusion models. These maps are then resized and averaged to form a new map. The visualizations are shown in Fig. 1.

3.2 Prompts Generation

We employ the ImageNet [10] label-space as a prompt cue to create synthetic images. It is vital for the model to encounter a wide variety of images in order to learn universal representations applicable to various downstream tasks. A straightforward strategy to diversify synthetic images is to leverage a large language model to transform labels into sentences, effectively augmenting the prompts. However, simply utilizing a large model like T5 [53], as suggested by He et al. [18], could result in unrealistic prompts, thereby producing unrealistic images. This may inadvertently widen the gap between synthetic and real domains, leading to an inferior performance on downstream tasks. To generate more realistic and diverse images, we utilize GPT-3.5-turbo [54] to augment prompts. The augmentation templates vary based on the hierarchical level of classes in ImageNet, for example, “[Class (with other class)] is/are [somewhere]” or “[Class ] with [other class] is/are [doing something] [somewhere]".

3.3 Utilize Attention Masks for Unsupervised Learning

Upon generating diverse images and a complimentary attention mask via augmented prompts used as input to diffusion models, we suggest modifications to three common unsupervised learning frameworks: Contrastive Learning, Masked Modeling, and Vision-Language Pretraining. These proposed adjustments fully exploit the attention masks to augment unsupervised learning, thereby enhancing performance on downstream tasks.

3.3.1 Contrastive Learning

We adopt two widely used contrastive learning methods - SimCLR [1] and MoCo-v2 [22], to incorporate the use of free attention masks. For ease of explanation, we use SimCLR as a representative example, as depicted in Fig. 3 (a). As previously discussed in the introduction, using image-level features can be problematic when an image contains multiple instances, such as a girl and a dog. The augmented positive pairs, after random cropping, may contain different instances. This could negatively impact network training, as the distinct instance features that should be differentiated are instead conflated. To mitigate this, we use instance features based on the attention mask in place of image features.

Specifically, given an image, we apply a series of augmentations and a random crop, resulting in two cropped images, denoted as $x$ and $x^{\prime}$ . Concurrently, our instance attention masks are cropped in line with the image operation, yielding two sets of masks ${a^{1},\cdots,a^{N}}$ , ${a^{\prime 1},\cdots,a^{\prime N}}$ , with each set containing $N$ instances. These two image crops are then input into the encoder (e.g., ResNet50 [55]), resulting in outputs $z=f(x)$ , $z^{\prime}=f(x^{\prime})$ . At this stage, we utilize the features $z\in\mathbb{R}^{h\times w\times c},z^{\prime}\in\mathbb{R}^{h\times w\times c}$ , which are derived prior to the last average pooling layer. Next, the $m-th$ instance attention masks $a^{m},a^{\prime m}$ , resized to match the spatial resolution of the encoded features, are mapped to the features $z$ and $z^{\prime}$ , thereby applying attentive pooling. This process results in:

z^{m}=\frac{1}{\sum_{i,j}a_{i,j}}\sum_{i,j}^{h,w}a_{i,j}z_{i,j},

(2)

and $z^{\prime m}\in\mathbb{R}^{C}$ . Following the application of attention pooling, the features $z^{m}$ and $z^{\prime m}$ are transitioned from the image level to the instance level. Subsequently, a straightforward Multilayer Perceptron (MLP) layer is applied to these features. For instance-level features, we redefine the contrastive loss for $z^{m}$ as

l=-\log\frac{\exp(sim(z^{m}\cdot z^{\prime m}))}{\exp(sim(z^{m}\cdot z^{\prime m}))+{\sum_{n=1}^{N}}_{[m\neq n]}\exp(sim(z^{m}\cdot z^{n}))},

(3)

where $sim\left(z^{m},z^{\prime m}\right)=\frac{{z^{m}}^{\top}\cdot z^{\prime m}}{\left|z^{m}\right|\cdot\left|z^{\prime m}\right|}$ . In this loss computation, we consider the features of the same instance from the two crops as the positive pair, and the features of different instances as negative pairs. In Equation (3), $N$ signifies the total number of instances in the image. When extended to the batch dimension, $N$ can represent the total number of instances in the batch.

For MoCo-v2, the encoders for the two image crops are distinct. One encoder is updated by the Exponential Moving Average (EMA) [4] of the other encoder. Furthermore, instance-level features, as opposed to image-level, are updated and stored in the memory bank. In addition to addressing the issue where positive pairs may comprise different instances, this strategy enables every image to provide a wealth of information. This greatly aids the network and allows for the learning of a more diverse representation, given the fact that each image typically encompasses multiple instances.

3.3.2 Masked Modeling

As indicated in [24], certain image patches can be challenging to reconstruct, and these often represent the foreground objects in images. Prioritizing the reconstruction of these difficult patches aids the network in learning a more discriminative representation. To accomplish this, Wang et al. [24] introduce an additional teacher-student network to predict these difficult patches. However, deploying an extra teacher-student model will bring extra computation costs and complicate the learning process. In contrast, our free attention mask naturally embodies the importance score of the foreground object mask. The scores of the attention map, ranging from low to high, correspond to patches from easy to difficult. This allows us to discard the teacher-student model for identifying challenging patches.

An intuitive approach would be to mask the patches with the highest attention map scores and then use pretraining from scratch to reconstruct the masked images. However, solely focusing on reconstructing the difficult patches may cause the network to overly concentrate on the foreground object, which could be detrimental to learning a more universal representation of the entire image. To mitigate this, during the initial stages of training, we continue to mask the patches randomly. As the training epochs increase (Fig. 3 (b)), we gradually raise the ratio of masked patches determined by the highest importance scores, and reduce the ratio of randomly selected masked patches. This balanced approach enables the network to learn both universal and targeted representations simultaneously.

3.3.3 Vision-Language Pretraining.

The capacity for position grounding [56, 57] is vital for a vision-language model to excel in cross-modality downstream tasks. Some studies strive to enhance this capability by incorporating bounding box and region features as additional visual inputs during vision-language pretraining. However, obtaining these features and bounding boxes for the objects in the image necessitates the use of a robust, pre-trained offline detection model, such as Fast-RCNN [58]. This process can be time-consuming and often results in a significant increase in the parameters for the vision-language models. In contrast, our synthetic data inherently includes the bounding box of the object (nouns). This is made possible as we can readily transform the attention mask into a binary mask, with pixels marked as ‘1’ representing the foreground region, thereby allowing us to obtain the bounding box.

Rather than extracting regions using bounding boxes as inputs for the visual encoder, we employ position-aware prompts, inspired by [59], which does not impose additional parameters or computational demands on the vision-language model. As illustrated in Fig. 3 (c), we initially divide the image into $N$ blocks. After determining the bounding box of the object, we can identify the block in which the object’s center is located. Using this information, we generate position-aware prompts following the template: “ The [O] is in block [P].”

Subsequently, we concatenate these prompts for all objects in the images with the original prompt. This forms the input for the text encoder, enhancing the position-grounding ability of the vision-language model. The model in focus, BILP, is adapted to our needs for Vision-Language Pretraining (VLP). We then conduct end-to-end training using conventional objectives. Consistent with the methodologies outlined in [8, 9, 60], the training process involves the use of Language Modeling (LM) loss, Image-Text Matching (ITM) loss, and Image-Text Contrastive (ITC) loss. Note that the object’s positional information is only required during the pre-training stage. For downstream tasks, we evaluate the model using standard end-to-end methods, without the need for object information.

4 Experiments

In the process of pretraining for contrastive learning and masked modeling, we generate approximately 1.2 million images using augmented prompts, a quantity that matches precisely the original ImageNet-1K [10] dataset for fair comparisons. For the supervised pretraining phase, we utilize real data tagged with class labels. As for pretraining in the vision-language task, we select a subset from CC3M [61], which encompasses 0.3 million image-text pairs, and employ the original captions to generate images.

4.1 Contrastive Learning

4.1.1 SimCLR

Pretraining. We employ ResNet50 [62] as the encoder for our pretraining. To maintain fairness in our comparisons, we adhere to the same 200 pretraining epochs across all settings, and all hyperparameters for training are aligned with those outlined in the original SimCLR paper[1]. As outlined in Tab. 4, various terms denote different pretraining methods. ‘Random init’ implies no pretraining, ’supervised’ refers to pretraining with ImageNet $\&$ labels, and ‘real’ indicates self-supervised pretraining on ImageNet. ‘Synthetic’ stands for self-supervised pretraining on purely synthetic images, while ‘synthetic w/ ours’ signifies our adapted method using synthetic images with free attention masks. ‘Mix’ involves a blend of all masked synthetic and real images.

Object Detection and Segmentation. The pretrained network is utilized to initialize our feature extractor. For object detection and instance segmentation in COCO [26], we, in accordance with [4], modify the Mask-RCNN [63] to be equipped with feature pyramid networks. The complete model is fine-tuned on the training dataset, following a standard 1x schedule (12 epochs), and we report the results as bounding box AP ( $AP^{b}$ ) and instance mask AP( $AP^{m}$ ). In the case of semantic segmentation for Cityspace [27], we fine-tune the model over a span of 160 epochs, reporting the results in terms of $mIoU$ (mean Intersection over Union).

Results. As shown in Tab. 4, self-supervised pretraining using purely synthetic data leads to a noticeable performance discrepancy on the COCO and City Space datasets when compared to the use of purely real data. However, with the introduction of the free attention mask, we observe a marked improvement in results: an increase of 1.3 $\%$ and 1.1 $\%$ on the COCO dataset, and 0.8 $\%$ on the CitySpace dataset. This improvement narrows the gap with real data to an almost negligible level. These findings underscore the efficacy of our approach to instance-level contrastive learning, facilitated by the use of free attention masks. Moreover, when combining both real and synthetic data, we see not only an enhancement in performance but also a surpassing of the results achieved using purely real data. This is accomplished without incurring any costs associated with human annotation or data collection.

4.1.2 MoCo-v2

Pretraining. We employ ResNet-50 as the encoder and train it for a total of 200 epochs. During this training phase, we adopt a learning rate adjustment strategy wherein the learning rate is multiplied by 0.1 at two specific points: the 120th and the 160th epochs. This adjustment is applied consistently across all experimental settings. Aside from these specifics, all other training hyperparameters strictly adhere to the guidelines and recommendations provided in the original MoCo-v2 paper [22]. This approach ensures that our training process is rigorous and consistent with established best practices.

Table 1: SimCLR [1] downstream results.

	COCO		Cityspace
	$AP^{b}$	$AP^{m}$	$mIoU$
random ini	31.4	28.5	65.2
supervised	39.2	35.5	74.2
real	37.7	34.2	74.8
synthetic	36.4	33.1	73.7
synthetic w/ ours	37.7 (+1.3)	34.2 (+1.1)	74.5(+0.8)
mix w/ ours	38.7	34.8	75.0

Table 2: BLIP [8] image-text retrieval results.

	MS-COCO
	$tr@1$	$ir@1$
real	58.1	44.2
VQGAN [64]	35.6	32.0
DALL-E2 [65]	44.5	38.6
Stable [34]	52.3	40.9
w/ ours	54.9 (+2.6)	43.8 (+2.9)
mix w/ ours	60.8	46.2

Table 3: MoCo-v2 [22] downstream results.

	PASVOC	COCO	Citysapce
	$AP^{b}_{50}$	$AP^{b}$	$mIoU$
random ini	59.0	31.4	65.2
supervised	81.6	39.2	74.2
real	82.4	39.8	75.0
synthetic	81.6	37.9	74.1
synthetic w/ ours	82.1 (+0.5)	38.3 (+0.4)	74.8 (+0.7)
mix w/ ours	82.6	40.1	75.3

Table 4: MAE [5] downstream results.

	ImageNet	ADE20K
	$acc$	$mIoU$
supervised	81.0	47.4
real	83.6	48.1
synthetic	82.7	47.6
synthetic w/ ours	83.2 (+0.5)	48.0 (+0.4)
mix w/ ours	83.8	48.5

Object Detection and Semantic Segmentation. In the context of object detection evaluation, we adhere to the standard protocol of fine-tuning a Faster R-CNN detector (with a C4 backbone) on the VOC trainval07+12 set, as per the 2x schedule mentioned in [49]. The evaluation is then carried out on the VOC test2007 set. In evaluating COCO detection, our fine-tuning strategies remain consistent with those applied in the previously discussed SimCLR framework. When it comes to semantic segmentation in Cityspace, we utilize an FCN [66] model. This model is trained on the train-fine dataset over 40k iterations. The testing is then carried out on the validation dataset.

Results. Tab. 4 reveals a pattern similar to that observed with SimCLR: when pretraining is conducted solely on synthetic images, there is a noticeable performance gap compared to pretraining on real images. Yet, interestingly, the results obtained using only synthetic images can rival and even surpass those achieved through supervised learning on real images, a process that typically requires substantial human annotation and collection efforts. Furthermore, the application of free attention masks enhances the performance on synthetic images to a level comparable with that on real data, effectively bridging the gap between synthetic and real images. When pretraining incorporates both synthetic and real images, the results show an improvement over both supervised learning with real data and self-supervised pretraining, all without the need for human labor.

4.2 Masked Modeling

Pretraining. We employ MAE [5], a representative masked modeling method. We follow its original pretraining protocol for the ViT-B [67], spanning a total of 1600 epochs. The overall mask ratio is set at 75 $\%$ . Additionally, we adopt a strategy to incrementally increase the ratio $\beta$ of masked patches according to the highest attention score, with a ceiling of 0.8, following a linear progression as epochs increasing. Thus, a portion of the masked patches, specifically ( $\beta\times 0.75$ ) of them, are determined based on the attention score. The remaining masked patches, which account for ( $(1-\beta)\times 0.75$ ) of the total, are selected at random.

Image Classification and Semantic Segmentation. For classification tasks, we utilize the Masked Auto Encoder (MAE) [5]. We carry out end-to-end fine-tuning for 100 epochs on the ImageNet-1K training set and subsequently report the top-1 accuracy on the validation set. For segmentation tasks, we use UperNet [68] as the segmentation head. This is then subjected to end-to-end fine-tuning on the ADE20K [28] dataset for a total of 160k iterations.

Results. As shown in Tab. 4, MAE [5] achieves consistent trend results when compared to MoCo-v2. Utilizing free attention masks can assist in bridging the discrepancy between synthetic and real data. Additionally, by exclusively employing synthetic data, which does not require human annotation or collection costs, we can significantly outperform fully annotated supervised training (83.8 vs. 81.0).

4.3 Vision Language Pretraining

Pretraining. We train the vision-language model following BLIP [8] on 0.3 M text-image pairs until the loss converges.

COCO-Retrieval. We evaluate BLIP on the COCO datasets, focusing on image-to-text retrieval (TR) and text-to-image retrieval (IR). The pre-trained model is fine-tuned using both ITC and ITM loss functions. Additionally, we implement a re-ranking strategy to further refine the retrieval results.

Results. Tab. 4 illustrates that with free attention mask, which provides position-aware prompts, enables the model to attain a substantial improvement over results obtained using purely synthetic data. Furthermore, combining synthetic data with the mask and real data considerably enhances the results, all without incurring any costs associated with human annotation or data collection.

4.4 Ablation Study

Image quality. Stable Diffusion is the most effective open-source model for text-to-image generation, surpassing both VQGAN [64] and DALL-E2-LAION [65] in terms of visual quality. We have compared the impact of pretraining on images synthesized by these three generators. Tab. 4 reveals that higher image quality correlates with improved results. This suggests that as more powerful generators emerge in the future, the prospect of pretraining on synthetic data becomes increasingly promising.

PASVOC	base	augment
$AP^{50}$	81.0	81.6

Table 5: Prompts ablations.

Prompts design. We assess the significance of augmented prompts in our study, simply employing a basic prompt, "a photo of [class]," for comparison purposes. As indicated in Tab. 5, augmented prompts contribute significantly to the generation of a broader range of images, enhancing the pretraining process.

More data, more power. We examine the effectiveness of our framework powered by an increasing quantity of diffusion-generated images. Here we adopt MoCo-v2 [22] to incorporate our Free-ATM, and report the pretraining transfer performance on the PASVOC [25] dataset. The results, displayed in Tab. 6, demonstrate a consistent performance enhancement in line with the increase in synthetic image count. This observation suggests that synthetic images can indeed scale in a manner similar to real images.

number of synthetic images	0.5M	1M	1.5M	2M
$AP^{b}_{50}$	81.3	82.1	82.4	82.7

Table 6: Using more synthetic images for our unsupervised pretraining framework.

Moreover, by utilizing 1.5M synthetic images, our approach achieves equivalent performance to the original MoCo-v2, which is trained on 1M real-world images from ImageNet [10]. This finding highlights that by increasing the amount of synthetic data, our method can effectively close the performance gap between unsupervised pretraining on synthetic data and real-world scenarios. Supported by our Free-ATM, synthetic data can be used as a viable alternative to real-world data for unsupervised pretraining, which can be beneficial in scenarios where real-world data is scarce or insufficient. In the future, we will keep exploring the upper bounds of our framework using more synthetic data.

5 More Visualizations of Attention Masks

Fig. 4 provides diverse visualizations of the freely-available attention masks from text-to-image diffusion models, where the cross-attention maps between object instances and their corresponding text prompts are decently aligned.

6 Conclusion

We have proposed Free-ATM, a technique that utilizes the readily available attention masks from diffusion generators, to improve unsupervised learning on diffusion-generated images. Our method is demonstrated to be effective through extensive experimental results across various downstream tasks and benchmarks. We anticipate that our approach will provide new directions for future research on leveraging synthetic images to tackle computer vision problems.

References

[1] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in ICML, 2020.
[2] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar, B. Piot, k. kavukcuoglu, R. Munos, and M. Valko, “Bootstrap your own latent - a new approach to self-supervised learning,” in NeurIPS, 2020.
[3] X. Chen and K. He, “Exploring simple siamese representation learning,” in CVPR, 2021.
[4] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in CVPR, 2020.
[5] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in CVPR, 2022.
[6] Z. Xie, Z. Zhang, Y. Cao, Y. Lin, J. Bao, Z. Yao, Q. Dai, and H. Hu, “Simmim: A simple framework for masked image modeling,” in CVPR, 2022.
[7] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021.
[8] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in ICML, 2022.
[9] W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,” in ICML, 2021.
[10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009.
[11] T. Orekondy, B. Schiele, and M. Fritz, “Towards a visual privacy advisor: Understanding and predicting privacy risks in images,” in ICCV, 2017.
[12] V. Besnier, H. Jain, A. Bursuc, M. Cord, and P. Pérez, “This dataset does not exist: Training models from generated images,” in ICASSP, 2020.
[13] B. Zhao and H. Bilen, “Synthesizing informative training samples with gan,” NeurIPS 2022 Workshop on Synthetic Data for Empowering ML Research, 2022.
[14] A. Brock, J. Donahue, and K. Simonyan, “Large scale GAN training for high fidelity natural image synthesis,” in ICLR, 2019.
[15] Y. Zhang, H. Ling, J. Gao, K. Yin, J.-F. Lafleche, A. Barriuso, A. Torralba, and S. Fidler, “Datasetgan: Efficient labeled data factory with minimal human effort,” in CVPR, 2021.
[16] D. Li, H. Ling, S. W. Kim, K. Kreis, A. Barriuso, S. Fidler, and A. Torralba, “Bigdatasetgan: Synthesizing imagenet with pixel-wise annotations,” in CVPR, 2022.
[17] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” TPAMI, 2021.
[18] R. He, S. Sun, X. Yu, C. Xue, W. Zhang, P. Torr, S. Bai, and X. QI, “IS SYNTHETIC DATA FROM GENERATIVE MODELS READY FOR IMAGE RECOGNITION?” in ICLR, 2023.
[19] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” in ICML, 2022.
[20] A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-Or, “Prompt-to-prompt image editing with cross attention control,” arXiv preprint arXiv:2208.01626, 2022.
[21] W. Zhao, Y. Rao, Z. Liu, B. Liu, J. Zhou, and J. Lu, “Unleashing text-to-image diffusion models for visual perception,” arXiv preprint arXiv:2303.02153, 2023.
[22] X. Chen, H. Fan, R. Girshick, and K. He, “Improved baselines with momentum contrastive learning,” arXiv preprint arXiv:2003.04297, 2020.
[23] H. Bao, L. Dong, S. Piao, and F. Wei, “BEit: BERT pre-training of image transformers,” in ICLR, 2022.
[24] H. Wang, K. Song, J. Fan, Y. Wang, J. Xie, and Z. Zhang, “Hard patches mining for masked image modeling,” in CVPR, 2023.
[25] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” IJCV, 2010.
[26] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014.
[27] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in CVPR, 2016.
[28] B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba, “Semantic understanding of scenes through the ade20k dataset,” IJCV, 2019.
[29] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in NeurIPS, 2020.
[30] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” in ICLR, 2021.
[31] J. Ho and T. Salimans, “Classifier-free diffusion guidance,” arXiv preprint arXiv:2207.12598, 2022.
[32] P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” in NeurIPS, 2021.
[33] Y. Shi, C. Xue, J. Pan, W. Zhang, V. Y. Tan, and S. Bai, “Dragdiffusion: Harnessing diffusion models for interactive point-based image editing,” arXiv preprint arXiv:2306.14435, 2023.
[34] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in CVPR, 2022.
[35] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,” in NeurIPS, 2022.
[36] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI, 2015.
[37] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
[38] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Communications of the ACM, 2020.
[39] A. Jahanian, X. Puig, Y. Tian, and P. Isola, “Generative models as a data source for multiview representation learning,” in ICLR, 2022.
[40] S. Azizi, S. Kornblith, C. Saharia, M. Norouzi, and D. J. Fleet, “Synthetic data from diffusion models improves imagenet classification,” arXiv preprint arXiv:2304.08466, 2023.
[41] M. B. Sariyildiz, K. Alahari, D. Larlus, and Y. Kalantidis, “Fake it till you make it: Learning transferable representations from synthetic imagenet clones,” in CVPR, 2023.
[42] P. Bachman, R. D. Hjelm, and W. Buchwalter, “Learning representations by maximizing mutual information across views,” in NeurIPS, 2019.
[43] O. J. Hénaff, A. Srinivas, J. De Fauw, A. Razavi, C. Doersch, S. Eslami, and A. v. d. Oord, “Data-efficient image recognition with contrastive predictive coding,” arXiv preprint arXiv:1905.09272, 2019.
[44] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised feature learning via non-parametric instance discrimination,” in CVPR, 2018.
[45] I. Misra and L. v. d. Maaten, “Self-supervised learning of pretext-invariant representations,” in CVPR, 2020.
[46] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
[47] M. Ye, X. Zhang, P. C. Yuen, and S.-F. Chang, “Unsupervised embedding learning via invariant and spreading instance feature,” in CVPR, 2019.
[48] Y. Tian, D. Krishnan, and P. Isola, “Contrastive multiview coding,” in ECCV, 2019.
[49] X. Wang, R. Zhang, C. Shen, T. Kong, and L. Li, “Dense contrastive learning for self-supervised visual pre-training,” in CVPR, 2021.
[50] J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong, “ibot: Image bert pre-training with online tokenizer,” in ICLR, 2022.
[51] A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “Data2vec: A general framework for self-supervised learning in speech, vision and language,” arXiv preprint arXiv:2202.03555, 2022.
[52] C. Wei, H. Fan, S. Xie, C.-Y. Wu, A. Yuille, and C. Feichtenhofer, “Masked feature prediction for self-supervised visual pre-training,” in CVPR, 2022.
[53] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of Machine Learning Research, 2020.
[54] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” in NeurIPS, 2020.
[55] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
[56] L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y. Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwang et al., “Grounded language-image pre-training,” in CVPR, 2022.
[57] Z. Liu, S. Stent, J. Li, J. Gideon, and S. Han, “Loctex: Learning data-efficient visual representations from localized textual supervision,” in ICCV, 2021.
[58] R. Girshick, “Fast r-cnn,” in ICCV, 2015.
[59] A. J. Wang, P. Zhou, M. Z. Shou, and S. Yan, “Position-guided text prompt for vision-language pre-training,” in CVPR, 2023.
[60] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021.
[61] P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” in ACL, 2018.
[62] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
[63] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask r-cnn,” in ICCV, 2017.
[64] P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high-resolution image synthesis,” in CVPR, 2021.
[65] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in ICML, 2021.
[66] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR, 2015.
[67] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR, 2021.
[68] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun, “Unified perceptual parsing for scene understanding,” in ECCV, 2018.

Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images with Free ATtention Masks