Content-Aware Preserving Image Generation

Giang H. Le¹¹footnotemark: 1 [email protected] Anh Q. Nguyen²²footnotemark: 2 [email protected] Byeongkeun Kang³³footnotemark: 3 [email protected] Yeejin Lee⁴⁴footnotemark: 4 [email protected]

Abstract

Remarkable progress has been achieved in image generation with the introduction of generative models. However, precisely controlling the content in generated images remains a challenging task due to their fundamental training objective. This paper addresses this challenge by proposing a novel image generation framework explicitly designed to incorporate desired content in output images. The framework utilizes advanced encoding techniques, integrating subnetworks called content fusion and frequency encoding modules. The frequency encoding module first captures features and structures of reference images by exclusively focusing on selected frequency components. Subsequently, the content fusion module generates a content-guiding vector that encapsulates desired content features. During the image generation process, content-guiding vectors from real images are fused with projected noise vectors. This ensures the production of generated images that not only maintain consistent content from guiding images but also exhibit diverse stylistic variations. To validate the effectiveness of the proposed framework in preserving content attributes, extensive experiments are conducted on widely used benchmark datasets, including Flickr-Faces-High Quality, Animal Faces High Quality, and Large-scale Scene Understanding datasets.

keywords:

Generative models, image generation, unsupervised learning, feature representation, self-supervised learning

\affiliation

[label1]organization=Department of Electrical and Information Engineering, Seoul National University of Science and Technology, addressline=232 Gongneung-ro, Nowon-gu, city=Seoul, postcode=01811, country=Republic of Korea \affiliation[label2]organization=Department of Electronic Engineering, Seoul National University of Science and Technology, addressline=232 Gongneung-ro, Nowon-gu, city=Seoul, postcode=01811, country=Republic of Korea

1 Introduction

Remarkable progress has been made in computer vision tasks with the emergence of deep artificial neural networks, specifically convolutional neural networks (CNN) (Tan and Le, 2019; He et al., 2016; Hu et al., 2018; Szegedy et al., 2016; Simonyan and Zisserman, 2015; Brock et al., 2021) and vision transformers (ViT) (Dosovitskiy et al., 2021; Bai et al., 2022; Zhang et al., 2022). These models demonstrate remarkable performance across a spectrum of supervised vision tasks in various industrial domains, for example, defect detection and quality control in manufacturing, anomaly detection in security applications, object recognition for robot vision (Tan and Le, 2021; Redmon and Farhadi, 2018; He et al., 2017; Dosovitskiy et al., 2021; Lee et al., 2022; Zhang et al., 2021; Wang et al., 2023a), and scene understanding for autonomous vehicle (Liu et al., 2021; Lu et al., 2023; Han et al., 2024b). However, the process of data cleaning and annotation required for supervised tasks are highly costly (Deng et al., 2009; Lin et al., 2014; Krizhevsky et al., 2009), making it challenging to apply these networks without readily available annotations.

To address this, generative models have emerged as promising solutions. For instance, generative adversarial networks (GANs) allow for high-quality image generation that closely aligns with the desired dataset distribution, offering a way to obtain the data needed for vision tasks (Karnewar and Wang, 2020; Karras et al., 2020a, 2019, b; Odena et al., 2017; Choi et al., 2018, 2020; Karras et al., 2021; Zhou et al., 2021; Mahendren et al., 2023). Similarly, recent diffusion models (Ho et al., 2020; Song et al., 2020; Zhuang et al., 2023; Wang et al., 2023b) have gained attention for their ability to produce high-quality images. With advancements in large language models, diffusion models have increasingly focused on integrating visual models with other modalities, such as text encoders, to enable text-conditional generation. This integration makes them particularly effective for tasks like text-to-image synthesis (Rombach et al., 2022; Hertz et al., 2022; Han et al., 2024a). While diffusion models are currently preferred for text-based applications, GANs still have significant potential, especially in scenarios focused solely on image-based applications. Once trained, GANs offer the advantage of faster image generation and have been very successful in producing sharp, high-quality images where their specific strengths can be fully exploited.

The GAN is a framework that estimates generative models through an adversarial process (Goodfellow et al., 2014). GANs have played a significant role in various computer vision applications and have been an active research area since their invention. Despite the considerable success of GANs in the literature, effectively managing and controlling specific contents of generated images–such as underlying spatial structure or precise attributes–remains a challenging task. This challenge stems from the fundamental training objective of GANs, which primarily focuses on mapping the distribution of output images to an input distribution rather than explicitly generating images with desired contents (Goodfellow et al., 2014; Karras et al., 2019, 2020b; Karnewar and Wang, 2020), as depicted in Figure 1(a). Consequently, generating images that meet users’ requirements using GANs is still a difficult problem.

Refer to caption — Figure 1: Comparison of (a) typical image generation with (b) the proposed content-preserving image generation. Typical GANs generate random images following the distribution of real images. In contrast, the proposed framework allows users to exert control over image generation, enabling them to specify desired content attributes in the generated images.

To overcome such a challenge, several methods have been proposed to impose specific constraints on the generated images, aiming to gain control over the image generation process. One approach is to convert GANs into a supervised learning manner by integrating auxiliary information, such as class labels or reference images, to guide the content of the generated outputs (Mirza and Osindero, 2014). This is done by concatenating additional inputs with random noise and feeding them into the networks. Additionally, researchers have proposed that discriminators can perform the classification task using categorical labels in conjunction with the critic to distinguish between real and fake samples. This is usually achieved by using a cross-entropy loss in the objective function (Odena et al., 2017). This approach has shown the possibility of controlling the content of generated images (Choi et al., 2018, 2020). However, manually annotating training images is highly labor-intensive and time-consuming. Moreover, widely-used benchmark datasets for image generation, such as Flickr-Faces High Quality (FFHQ) (Karras et al., 2019), Animal Faces High Quality (AFHQ) (Choi et al., 2018, 2020), and Large-scale Scene Understanding (LSUN) (Yu et al., 2015), are typically unlabeled. Hence, converting unsupervised tasks into supervised learning models may not be a practical solution to address the challenge of efficiently controlling the content of generated images by GANs.

Another recent approach involves understanding the latent space of GANs and using it to control image generation (Shen et al., 2020; Karras et al., 2019). These methods typically utilize a multilayer perceptron encoder to transform the Gaussian noise input into a disentangled distribution of the same dimension. By using low-distance samples from this disentangled distribution, it becomes easier to generate images with the desired style. However, these methods do not have the ability to directly generate feature vectors from real-world images that would influence the output images. Instead, they only provide control over output images through a predefined set of synthesized samples. More recently, an unsupervised method was proposed in (Balakrishnan et al., 2022) to control the attributes of generated images by finding the optimal direction in the latent space based on the differential changes in different feature sets. However, the acquisition of the feature sets requires additional models, such as a face recognition model for identity features or an age regression model for age features.

To tackle the previously mentioned labeling challenge and contending with constraints on control in limited settings, we propose a novel framework that can generate high-fidelity images with specific desired content characteristics as requested by users. The proposed framework leverages a real-world image to guide the generation process, resulting in output images that closely resemble the content of the input image, as illustrated in Figure 1(b). To achieve this, we introduce an encoding module incorporating two essential components: an encoder module for extracting content features and a content fusion module for generating refined content-guided vectors. This module analyzes generated images with known feature embeddings to determine the direction and intensity of content perturbations. Using this analyzed content feature vector, the proposed framework can produce images that preserve the contents of the guided images without being confined to predefined sets or constraints. Specifically, the encoding module is designed to concentrate solely on selected frequency components linked to intriguing content features. Its aim is to adeptly learn desired features while excluding undesired features from the learning process. The effectiveness of frequency selection in the generation process is verified by conducting an analysis to explore the impact of frequency bands on the overall process. This analysis provides convincing evidence that frequency analysis can serve as a meaningful guide in determining which components should be processed effectively to enhance content preservation and image quality. This finding also aligns with recent evidence that manipulating the frequency components of input images can significantly improve model performance (Ehrlich and Davis, 2019; Yin et al., 2019; Schwarz et al., 2021; Wang et al., 2020).

In summary, this work makes the following contributions:

1.

We propose a novel framework, named Content-Aware Preserving Image Generation Framework (CAP-GAN), which enables image generation with explicit control over the content of the output images, ensuring content fidelity with guiding images.
2.

We investigate the impact of frequency bands on image generation and demonstrate that frequency analysis can serve as a valuable guide for content preservation and image quality improvement.
3.

Extensive experiments have been conducted to validate the effectiveness of the proposed framework in accurately predicting the content of the generated images.

The remainder of this paper is organized as follows: Section 2 provides a comprehensive review of various types of GANs and diffusion models, and it explores the role of frequency analysis in vision processing. In Section 3, we present a detailed description of the proposed framework, including the structure of each block and the underlying design principles. We also explain the process of extracting frequency components and their seamless integration into the proposed framework. Section 4 presents the experimental results and additional studies, offering a thorough analysis and interpretation of the findings. Finally, in Section 5, we conclude the paper with final remarks.

2 Related Work

2.1 Content Control in Generative Models

Generative models have proven successful in image generation tasks. However, predicting the precise content of the outputs remains challenging due to the inherently stochastic nature of the generation process. To address this issue, some research has introduced methods that enable control over the details of the generated outputs.

One approach uses annotated information, such as labels, to improve control over the generated outputs. For example, the auxiliary classifier GAN (AC-GAN) incorporates the embedded label into the input latent noise (Odena et al., 2017). In AC-GAN, the discriminator plays a dual role: it distinguishes between real and fake samples and predicts the expected label of the input image. Other methods (Choi et al., 2018, 2020) have also attempted to control the attributes of the generated images using multi-class labels. Additionally, early research on diffusion models (Dhariwal and Nichol, 2021) has shown that, with classifier guidance, these models can achieve sample quality comparable to that of GANs. To eliminate the need for classifiers, the work (Ho and Salimans, 2022) proposes a method that jointly trains a conditional and an unconditional diffusion model and combines their results. However, this approach still requires annotated labels during training.

Another approach is to extract latent codes to control the image generation process. For example, MixNMatch (Li et al., 2020) generates latent vectors using four different encoders and combines these vectors to control different aspects of the generated content, with each encoder specializing in features such as background, object pose, shape, and texture. Similarly, the work (Balakrishnan et al., 2022) manages image attributes by finding the optimal direction in latent space based on differential changes between feature sets, using an auxiliary face identification model and a landmark extraction model. HistoGAN (Afifi et al., 2021) controls color information in generated images by analyzing differentiable histograms of colors in log-chroma space. The diffusion-GAN model (Wang et al., 2023b) integrates a diffusion model into the GAN framework, using the diffusion model as a generative component while applying adversarial training techniques to improve sample quality. The diffusion probabilistic field (DPF) (Zhuang et al., 2023) also captures latent information by parameterizing it as additional variables.

Note that the methods mentioned above, whether GAN-based or diffusion-based, can generate high-quality images but often require annotated information, such as labels. Even those methods that do not rely on annotations typically lack explicit mechanisms for controlling the content of the generated images. In contrast, the proposed framework offers significant advantages. It allows for explicit control over the content of output images by using latent embeddings derived from real-world images, focusing solely on the desired content while suppressing undesired components. This approach eliminates the need for additional annotation information. Moreover, the proposed framework enables content-preserving image generation with only a single generator, encoder, and discriminator, avoiding the complexity of using multiple encoders, generators, and discriminators. This streamlined architecture not only simplifies the model but also reduces computational cost and resource requirements, making it a more efficient and practical solution.

2.2 Frequency Analysis in Neural Networks

Previous research has extensively explored the relationship between frequency components and image perception, emphasizing the significance of both high- and low-frequency information (Oliva et al., 2006; Petras et al., 2019; Lee and Hirakawa, 2022). Early work in (Oliva et al., 2006) conducted experiments with combined images of two to demonstrate that high-frequency components are crucial for object recognition, capturing fine details and edges, and low-frequency components contribute to the perception of global scene layout and semantic context. Additionally, Petras et al. observed that low spatial frequencies provide coarse information that guides the integration of finer and high spatial frequency details, contributing to a more comprehensive understanding of visual input (Petras et al., 2019). More recently, the work in (Lee and Hirakawa, 2022) proposed a lossless image compression method that considers both low- and high-frequency components. This study highlighted the importance of preserving global structural information and fine details simultaneously, leading to improved compression outcomes.

Building upon the aforementioned findings, researchers have investigated the use of frequency information in deep neural networks for various computer vision tasks (Gueguen et al., 2018; Ehrlich and Davis, 2019; Yin et al., 2019). Likewise, in GANs, frequency-based methods have been explored for generating high-quality images (Schwarz et al., 2021; Wang et al., 2020; Gal et al., 2021; Li et al., 2023; Chen et al., 2021; Durall et al., 2020; Dzanic et al., 2020; Fuoli et al., 2021; Khayatkhoei and Elgammal, 2022; Frank et al., 2020). SWAGAN (Gal et al., 2021) is an example of using wavelet transformations to train models in the frequency domain, producing better results for small details than traditional methods. Some works have explored incorporating the frequency spectrum into the discriminator module (Li et al., 2023; Chen et al., 2021) or adding it as a regularization term to the objective function (Durall et al., 2020). These works aim to enhance the generation of synthetic images by emphasizing the frequency details. Other studies have highlighted the disparity between real and synthetic images in the frequency domain, despite their visual similarity in the spatial domain. To address this issue, some works focus on utilizing augmentation techniques to bridge the gap in the frequency domain (Dzanic et al., 2020). Others utilize frequency information directly in the training process (Fuoli et al., 2021) or manipulate the upsampling methods within the generator module (Khayatkhoei and Elgammal, 2022; Frank et al., 2020).

The previously mentioned studies have shown that converting images in the spatial domain to the frequency domain can help models capture important features that might be difficult to extract in the image domain. However, this approach also carries a risk of losing features that are only present in the image domain. As such, the proposed work develops a different strategy. We initially process the image in the frequency domain, implementing techniques to eliminate redundant features and extract content information. Subsequently, we convert this processed information back to the image domain, allowing our model to leverage additional information in the spatial domain. This approach retains both the advantages of frequency-based methods in capturing trimmed meaningful features and the benefits of working directly in the spatial domain to preserve unique structural features.

3 Proposed Methodology

Considering the challenges of controlling the content of generated images solely from a random noise vector, we propose a framework that incorporates real-world images to guide the encoding of desired content features. In Section 3.1, we present the overall structure of the proposed framework, outlining the main components and their interactions. In Section 3.2, we outline the training strategy for each module in the framework and explain the rationale behind our design choices.

3.1 Overview

Figure 2 provides an overview of the architectural design of the proposed framework during training, and Figure 3 illustrates the process for inference. The proposed framework consists of two main modules: a generating module and a frequency encoding module.

Generating Module. This module aims to synthesize images from random vectors and includes three components: the mapping network $E_{\mathbf{\psi}}$ , the discriminator $D_{\mathbf{\zeta}}$ , and the generator $G_{\mathbf{\phi}}$ , which are adopted from StyleGAN2.

The generator $G_{\mathbf{\phi}}$ , parameterized by $\mathbf{\phi}$ , takes the vectors $\mathbf{w}_{\scriptscriptstyle 1}\in\mathbb{R}^{512}$ and $\mathbf{w}_{\scriptscriptstyle 2}\in\mathbb{R}^{512}$ as inputs and produces synthesized images $\mathbf{x}\in\mathbb{R}^{256\times 256\times 3}$ . The vectors $\mathbf{w}_{\scriptscriptstyle 1}$ and $\mathbf{w}_{\scriptscriptstyle 2}$ are projected by the mapping network $E_{\mathbf{\psi}}$ , parameterized by $\psi$ , from $\mathbf{z}_{\scriptscriptstyle 1}$ and $\mathbf{z}_{\scriptscriptstyle 2}$ . The vectors $\mathbf{z}_{\scriptscriptstyle 1}$ and $\mathbf{z}_{\scriptscriptstyle 2}$ are randomly sampled from a standard normal distribution $\mathcal{N}(0,1)$ . Once $G_{\mathbf{\phi}}$ produces images $\mathbf{x}$ , the discriminator $D_{\mathbf{\zeta}}$ evaluates them, along with real-world image $\mathbf{y}\in\mathbb{R}^{256\times 256\times 3}$ to determine whether they should be classified as real or fake.

Frequency Encoding Module. This module aims to encode the final content-guiding vector $\mathbf{q}^{4}\in\mathbb{R}^{1024}$ from the frequency-refined image $\mathbf{x}_{f}$ , as shown in Figue 2. The vector $\mathbf{q}^{4}\in\mathbb{R}^{512}$ is then used as an input to $G_{\mathbf{\phi}}$ , replacing $\mathbf{w}_{\scriptscriptstyle 1}$ , to control the content of generated images at inference.

This module consists of two components: the content fusion module (CFM) and the encoder module (EM). The role of the EM is to progressively extract features from an input image. This process begins by taking a feature map derived from a frequency-refined intensity image $\mathbf{x}_{\scriptscriptstyle f}\in\mathbb{R}^{128\times 128}$ and then producing the intermediate content-guiding vectors $\mathbf{s}\in\mathbb{R}^{1024}$ through the component manipulation block (CMB) and the projecting block (PB). The role of the CFM is to train learnable constant vectors and integrate them with $\mathbf{s}$ from the EM blocks. Specifically, the intermediate content-guiding vectors $\mathbf{s}$ are combined with the vectors $\mathbf{p}\in\mathbb{R}^{512}$ , trained through the CFM, and then transformed into the final content-guiding vector $\mathbf{q}^{4}\in\mathbb{R}^{512}$ .

Training Strategy. It is well established that the effects of injecting styles into style-based generators depend on the layers to which the styles are applied (Karras et al., 2019, 2020b; Brock et al., 2019; Karras et al., 2020a). Style vectors injected into coarse and middle (lower) layers (denoted as $\mathbf{w}_{1}$ in Figure 2) mainly affect the high-level features of an image, such as pose, general hairstyle, or face shape. In contrast, style vectors injected into finer (higher) layers (denoted as $\mathbf{w}_{2}$ in Figure 2) play a key role in defining background details or color information. Based on this understanding, the proposed method is designed to control image content by injecting the desired content styles into the coarse and middle layers, which are produced by the proposed frequency encoding module. Specifically, as shown in Figure 3, during inference, the generator $G_{\mathbf{\phi}}$ produces images using two vectors: a content-guiding vector $\mathbf{q}^{4}$ extracted from a real-world image and a noise-projected vector $\mathbf{w}_{\scriptscriptstyle 2}$ . The content-guiding vector $\mathbf{q}^{4}$ , obtained from the frequency encoding module, is injected into the coarse (lower) layers of the generator. Meanwhile, the vector $\mathbf{w}_{\scriptscriptstyle 2}$ , projected from a noise vector, is injected into the fine (higher) layers to introduce variation to the output images.

3.2 Training

To achieve the training goal, the proposed framework is trained in a two-phase process: 1) in the first phase, the generation module is optimized to generate images based on the projected noise vectors, and 2) in the second phase, the frequency encoding module is trained to generate content-guiding vectors, while keeping the parameters of the generation module frozen.

3.2.1 Phase I: Training Generating Module

This training phase has two primary objectives: to optimize $G_{\mathbf{\phi}}$ and $E_{\mathbf{\psi}}$ to generate high-fidelity images that closely match real-world image distributions, and to optimize $D_{\mathbf{\zeta}}$ to accurately distinguish the generated images as fake. The generator $G_{\mathbf{\phi}}$ is optimized using the loss determined by the evaluation of generated images:

\displaystyle\mathcal{L}_{\scriptscriptstyle G}=-\frac{1}{N_{b}}\sum\limits^{N_{b}}_{j=1}\log\left(\sigma\left(D_{\scriptscriptstyle\zeta}\left(G_{\scriptscriptstyle\phi}\left(\mathbf{w}_{\scriptscriptstyle 1,j}\right)\right)\right)\right),

(1)

where $N_{b}$ is the number of images in a mini-batch, $j$ denotes the image index, and $\sigma\left(\cdot\right)$ denotes the sigmoid function. Subsequently, the discriminator of this module is trained using the adversarial loss with a regularizer of the squared norm of the gradients of the discriminator on real images (Mescheder et al., 2018):

\displaystyle\begin{array}[]{l}\mathcal{L}_{adv}\!=\!\frac{1}{N_{b}}\sum\limits^{N_{b}}_{j=1}\Big{[}\log\left(\sigma\left(D_{\scriptscriptstyle\zeta}\left(\mathbf{y}_{\scriptscriptstyle j}\right)\right)\right)+\log\left(\sigma\left(1-D_{\scriptscriptstyle\zeta}\left(G_{\scriptscriptstyle\phi}\left(\mathbf{w}_{\scriptscriptstyle 1,j}\right)\right)\right)\right)\Big{.}\vspace{0.1cm}\\ \hskip 73.97733pt+\>\Big{.}\lambda\lvert\lvert\nabla D_{\scriptscriptstyle\zeta}\left(\mathbf{y}_{\scriptscriptstyle j}\right)\rvert\rvert^{\scriptscriptstyle 2}_{\scriptscriptstyle 2}\Big{]},\end{array}

(4)

where $\lambda$ is a weighting factor to control the influence of each loss on the total loss function, $\nabla$ denotes the gradient of a function, and $\lvert\lvert\cdot\rvert\rvert_{\scriptscriptstyle 2}$ denotes $\ell 2$ -norm.

3.2.2 Phase II: Training Frequency Encoding Module

The goal of the second training phase is to optimize the parameters of all components in the frequency encoding module. This is achieved by utilizing frequency-refined images that emphasize high-level content features while suppressing undesired fine details. The procedure for refining these frequency components is outlined in the section “Frequency Selection.” Using these refined inputs, the encoder modules analyze the patterns to produce intermediate content guide vectors (as detailed in the section of “Guiding Vector Generation”). These intermediate vectors are then used to produce the final content guide vectors (See the “Content Fusion” section). This process ensures that the extracted features align closely with the embeddings of reference images. The loss function used to measure this alignment is discussed in the “Loss Design” section.

Guiding Vector Generation. In the frequency encoding module, four EMs are connected in sequence ( $i=1,2,3,4$ ) to extract fine features from $\mathbf{x}_{\scriptscriptstyle f}$ and produce distilled guiding vectors ( $\mathbf{s}^{\scriptscriptstyle 1},\cdots,\mathbf{s}^{\scriptscriptstyle 4}$ ). These distilled vectors are then used to generate the final content-guiding vector $\mathbf{q}^{4}$ . In each EM, feature maps are processed by the CMB for further feature extraction. The resultant feature maps are then combined with the input feature maps using a $1\times 1$ convolution. This addition enhances the depth of the blocks by leveraging identity mappings and residual connections, which helps mitigate training issues such as exploding or vanishing gradients during backpropagation (He et al., 2016). Meanwhile, the feature maps produced by the CMB are passed through the PB, which transforms them into the distilled guiding vector, $\mathbf{s}^{\scriptscriptstyle i}=\left[\mathbf{s}_{\scriptscriptstyle a}^{\scriptscriptstyle i},\>\>\mathbf{s}_{\scriptscriptstyle b}^{\scriptscriptstyle i}\right]\in\mathbb{R}^{1024}$ for the $i$ -th EM. These distilled vectors are divided into two parts: the upper part $\mathbf{s}_{\scriptscriptstyle a}^{\scriptscriptstyle i}\in\mathbb{R}^{512}$ is responsible for modifying the direction of contents based on the extracted features while the lower part $\mathbf{s}_{\scriptscriptstyle b}^{\scriptscriptstyle i}\in\mathbb{R}^{512}$ takes responsible for translating the modified vector.

Content Fusion. The content fusion module comprises four fully connected (FC) layers followed by a feature transform (FT) layer. This module initiates with a learnable constant vector $\mathbf{p}^{\scriptscriptstyle 0}$ , which is initialized by sampling a randomly generated vector from a normal distribution with zero mean and unit variance, $\mathcal{N}(0,1)$ . The initial vector $\mathbf{p}^{\scriptscriptstyle 0}\in\mathbb{R}^{512}$ is then processed with the guiding vectors to incorporate its information. Concretely, followed by each fully connected layer, the activation $\mathbf{p}^{\scriptscriptstyle i}$ is transformed into the modified feature embedding $\mathbf{q}^{\scriptscriptstyle i}$ through scaling and translation with the content-guiding vector, as follows:

\displaystyle\mathbf{q}^{\scriptscriptstyle i}=\mathbf{s}_{\scriptscriptstyle a}^{\scriptscriptstyle i}\odot\mathbf{p}^{\scriptscriptstyle i}+\mathbf{s}_{\scriptscriptstyle b}^{\scriptscriptstyle i},

(5)

where $\odot$ denotes the Hadamard product, and $i$ denotes the block index ranging from $1$ to $4$ . In (5), $\mathbf{s}^{\scriptscriptstyle i}_{\scriptscriptstyle a}$ controls the direction of the content-guiding vector, and $\mathbf{s}^{\scriptscriptstyle i}_{\scriptscriptstyle b}$ controls intensity of the content perturbations.

Loss Design. The established understanding is that the inputs injected into coarse layers play a dominant role in abstracting high-level features, encompassing global structure, viewpoint, large-scale attributes, and object shape. Utilizing this fact, the entire frequency encoding module is trained to ensure that the generated outputs maintain similarities to the characteristics of the guiding image by minimizing the associated loss function. Denoting the output vector of the CFM as $\mathbf{q}$ , where $\mathbf{q}=\mathbf{q}^{\scriptscriptstyle 4}$ , the encoding module aims to align the content-guiding vector closely match to $\mathbf{w}_{\scriptscriptstyle 1}$ , while allowing perturbations through a randomly generated vector $\mathbf{w}_{\scriptscriptstyle 2}$ :

\displaystyle\mathcal{L}=\frac{1}{N_{b}}\sum\limits^{N_{b}}_{j=1}\lvert\lvert\mathbf{q}_{\scriptscriptstyle j}-\mathbf{w}_{\scriptscriptstyle 1,j}\rvert\rvert_{\scriptscriptstyle 2}^{\scriptscriptstyle 2}.

(6)

In the inference process, a real image $\mathbf{y}$ is used as input instead of generated images $\mathbf{x}_{\scriptscriptstyle f}$ . The guiding vector $\mathbf{q}$ is injected into the coarse blocks of the generator, and the style vector $\mathbf{w}$ is inputted into the remaining blocks. This approach enables the resulting image to capture attribute features from the input image (e.g., shape, age, and hairstyle in face image generation) while maintaining a similar-looking background or color distribution as the image generated by inputting $\mathbf{w}$ into all layers.

Frequency Selection. There is a large amount of evidence that the frequency component is a crucial factor in interpreting images (Oliva et al., 2006; Petras et al., 2019; Wang et al., 2020; Lee and Hirakawa, 2022; Gueguen et al., 2018; Ehrlich and Davis, 2019; Gal et al., 2021; Yin et al., 2019; Schwarz et al., 2021). These findings have been established through frequency component analysis, as exemplified in Figure 4. Acknowledging the distinct influence of different frequency components on image composition, as observed in the figure, we posit that leveraging specific frequency components can offer advantages in controlling the desired content and the quality of generated images. Subsequently, we conduct an analysis to examine this hypothesis. This involves integrating frequency information into the generation process and studying the impact of different frequency bands. As a result, we anticipate potential enhancement of the generation process that better preserves desired content characteristics and improves the overall quality of the generated images. This utilization of frequency components opens up new possibilities for more effective and precise control over the generated image content and overall image generation process.

To scrutinize our hypothesis, we employ the two-dimensional discrete Fourier transform (DFT) to extract frequency features, which is a widely used tool for frequency analysis. The DFT is a transform that converts a finite sequence of samples into a representation that describes frequency components. Given an intensity (grayscale) image $\widetilde{\mathbf{x}}\left(m,n\right)\in\mathbb{R}^{M\times N}$ converted from $\mathbf{x}$ , followed by the definition of luminance in (Series, 2011), the two-dimensional DFT $\>\mathbf{X}:\mathbb{R}^{M\times N}\rightarrow\mathbb{C}^{M\times N}$ is defined as follows (Gonzalez, 2009):

\displaystyle\mathbf{X}\left(u,v\right)=\sum_{m=0}^{M-1}\sum_{n=0}^{N-1}\widetilde{x}\left(m,n\right)e^{-j2\pi(\frac{um}{M}+\frac{vn}{N})},

(7)

where $u=0,1,\cdots,M-1$ and $v=0,1,\cdots,N-1$ denote equally spaced frequency variables, and $\left(m,n\right)$ denotes a spatial location of $\widetilde{\mathbf{x}}$ . To reduce storage and bandwidth requirements, we resize the input intensity image to half its original size, resulting in $\widetilde{\mathbf{x}}\in\mathbb{R}^{128\times 128}$ , where $M$ and $N$ equal to $128$ . This resizing operation has been found to have no significant impact on the model’s performance.

Each computed DFT component in (7) corresponds to a specific frequency, which is determined by its position $u$ and $v$ . The magnitude of each component represents the strength or intensity of that frequency in $\widetilde{\mathbf{x}}$ . Multiplying the magnitude of each DFT component by $(-1)^{m+n}$ shifts the direct current (DC) component $X(0,0)$ to the center $(\lfloor M/2\rfloor,\>\lfloor N/2\rfloor)$ of the frequency spectrum. The shifted DFT is then denoted as $\mathbf{Y}(k,\ell)=\mathbf{X}(u-\lfloor M/2\rfloor,v-\lfloor N/2\rfloor)$ , where $k$ and $\ell$ represent both positive and negative frequencies, i.e., $k=-\lfloor M/2\rfloor,\cdots,\lceil M/2\rceil-1$ and $\ell=-\lfloor N/2\rfloor,\cdots,\lceil N/2\rceil-1$ . In the experiments of this paper, we used this shifted version of the frequency spectrum.

To conduct a more rigorous analysis of the influence of frequency components across different bands, the frequency spectrum denoted as $\mathbf{Y}_{f}$ can be adjusted by selectively reducing specific frequency components, where the index $f\in\left\{L,H\right\}$ denotes the low-pass ( $L$ ) and high-pass ( $H$ ) filtering operations, defined as follows:

\displaystyle\mathbf{Y}_{L}\left(u,v\right)\!=\!\left\{\!\begin{array}[]{cl}\mathbf{Y}\left(u,v\right),\!\!&\!\!\textrm{if}\>\left|u-u_{c}\right|\leq b_{\scriptscriptstyle L}\land\left|v-v_{c}\right|\leq b_{\scriptscriptstyle L}\\ 0,&\textrm{otherwise},\end{array}\right.

(10)

\displaystyle\mathbf{Y}_{H}\left(u,v\right)\!=\!\left\{\!\begin{array}[]{cl}\mathbf{Y}\left(u,v\right),\!\!&\!\!\textrm{if}\>\left|u-u_{c}\right|\geq b_{\scriptscriptstyle H}\land\left|v-v_{c}\right|\geq b_{\scriptscriptstyle H}\\ 0,&\textrm{otherwise},\end{array}\right.

(13)

where $u_{c}$ and $v_{c}$ are the positions of zero frequency; and the hyper-parameters $b_{\scriptscriptstyle L}$ and $b_{\scriptscriptstyle H}$ denote the cutoff values for highlighting some range of frequencies.

Once frequency characteristics change by (10) or (13), the frequency-selected image $\mathbf{x}_{\scriptscriptstyle f}\left(m,n\right)\vspace{0.1cm}$ is acquired by the inverse DFT, $\>\mathbf{X}^{-1}:\mathbb{C}^{M\times N}\rightarrow\mathbb{R}^{M\times N}$ to preserve both spatial information and refined frequency information:

\displaystyle\mathbf{x}_{\scriptscriptstyle f}\left(m,n\right)=\frac{1}{MN}\sum_{u=0}^{M-1}\sum_{v=0}^{N-1}\mathbf{Y}_{f}\left(u,v\right)e^{j2\pi(\frac{mu}{M}+\frac{nv}{N})},

(14)

for $m=0,1,\cdots,M-1$ and $n=0,1,\cdots,N-1$ .

The impacts of frequency selection in image composition by (10) and (13) are visually demonstrated in Figure 4. The figure comprehensively depicts how different frequency bands emphasize specific characteristics when composing images. This visualization provides compelling evidence for the significant impact of frequency components in image composition. In particular, the inverse DFT image in (b) preserves low-frequency components by utilizing a filter described by (10), smoothing some details compared to the original images. On the other hand, the inverse DFT image in (c) highlights edges by applying a filter based on (13), suppressing homogeneous components in the original image. Based on these observations and theoretical reasoning (Strang and Nguyen, 1996; Gonzalez, 2009; Lee et al., 2018), we can prioritize low-frequency components that are directly related to the content of images, while suppressing high-frequency components that are mainly related to apperance. This approach allows us to focus on and transfer only the content of the image, providing a strong rationale for using frequency analysis in image generation. By emphasizing content and minimizing stylistic influence, frequency analysis and filtering improve content transfer and reduce stylistic interference. These advantages are further supported by the experimental results presented in Section 4.

4 Results and Discussion

This section presents experimental results based on the proposed method described in Section 3. An overview of the training dataset is provided in Section 4.1, followed by a discussion of the implementation details in Section 4.2. The effectiveness of the proposed framework is evaluated and demonstrated through the experimental results presented in Section 4.3. Finally, Section 4.4 provides an in-depth analysis, exploring the effects of frequency analysis on image generation and performance comparison metrics.

4.1 Datasets

The effectiveness of the proposed framework was evaluated using four benchmark datasets commonly used in image generation tasks: FFHQ (Karras et al., 2019), AFHQ Cat and Dog (Choi et al., 2020), and LSUN Church and Bedroom (Yu et al., 2015). Additionally, the CelebFaces Attributes
(CelebA) (Liu et al., 2015) dataset was utilized to train binary classification models and demonstrate the efficiency of the proposed approach.

FFHQ comprises $70,000$ high-quality images at high-definition (HD) resolution ( $1024\times 1024$ ), making it one of the largest datasets of its kind. It is widely used for its diverse collection of facial features, including variations in gender, age, and geographic origin. The dataset also includes a wide range of accessories such as glasses, necklaces, hats, earrings, and cosplay costumes. The FFHQ dataset also offers a variety of backgrounds, further enhancing the dataset’s richness and making it a good choice for researchers aiming to train machine-learning models that can recognize and synthesize facial features in a realistic manner.

AFHQ is a comprehensive and high-quality collection of animal face images. It contains a diverse range of breeds, ages, poses, and accessories. The images in the AFHQ dataset are of high resolution, with each image having dimensions of $512\times 512$ pixels. The dataset consists of three domains: cat, dog, and wildlife. Each domain contains nearly $5,000$ images. However, for the study of this paper, the cat and dog domains were used. To ensure consistency and quality, the images in the AFHQ dataset were carefully processed, including vertical and horizontal alignment to center the eyes. This pre-processing helps to ensure uniformity in the images and facilitates accurate analysis and modeling.

LSUN is a large-scale collection of scene images consisting of $10$ categories, including dining rooms, bedrooms, churches, and classrooms. It provides a substantial amount of training data, with each category containing approximately $120,000$ to $3,000,000$ images. Additionally, there is a small validation set consisting of around $300$ images and a test set of $1,000$ images for each category. For evaluation purposes, we focused on the bedroom and outdoor church categories. To train the GANs, we combined the train, validation, and test sets of these categories. The outdoor church dataset contains $126,227$ images, which is a reasonable size for training GANs. However, the bedroom dataset is much larger, containing over $3$ million images. To create a more manageable dataset, we randomly selected a subset of $100,000$ images from the original set. By selecting a representative subset of images, we aimed to strike a balance between dataset size and computational feasibility, ensuring that the training process remained efficient and effective.

CelebA is a widely used dataset that consists of a comprehensive collection of $202,599$ face images belonging to $10,177$ different celebrities. Each image in the dataset is annotated with $40$ binary labels, providing information about various facial attributes such as hair color, gender, age, and more. This extensive annotation makes the CelebA dataset valuable for training and evaluating algorithms related to face recognition, facial attribute prediction, and face synthesis. In the experiments of this paper, the CelebA dataset was specifically used to train the face attribute prediction model. By leveraging the vast amount of labeled facial images in the dataset, we aimed to develop a robust and accurate method for evaluating content preservation.

4.2 Implementation Details

The proposed framework was implemented using the PyTorch, version 1.13.1, CUDA Toolkit 11.7, and cuDNN 8.8.0 on a single NVIDIA GeForce RTX 3090 GPU with 24GB VRAM.

During the training of the generating module in Section 3.2.1, we followed the training strategy outlined in StyleGAN2 (Karras et al., 2020b) without data augmentation. To ensure stable training, we applied the equalized learning rate technique (Karras et al., 2019) to all trainable parameters. A leaky ReLU activation (Agarap, 2018) with a negative slope of $0.2$ was used, and a minibatch standard deviation layer was included at the end of the discriminator (Karras et al., 2019). For optimization, we employed the Adam optimizer (Kingma and Ba, 2015) with a learning rate of $0.0002$ and $\beta_{1}=0$ , $\beta_{2}=0.99$ . The minibatch size was set to $32$ , and the exponential moving average (EMA) was used on the generator and mapping network parameters. The EMA parameters were updated once every 10-generator iteration.

The frequency encoding module described in Section 3.2.2 was trained using the pre-trained models $E_{\mathbf{\psi}}$ and $G_{\mathbf{\phi}}$ from the generating module with $\mathbf{\psi}$ and $\mathbf{\phi}$ being frozen. During the training of this module, we applied batch normalization with a momentum value of $0.9$ . Each model was trained for a total of $50$ epochs, using a batch size of $32$ . The Adam optimizer was utilized with parameters $\beta_{1}=0.9$ , $\beta_{2}=0.999$ , and a learning rate of $0.0001$ . The depth of each encoder module progressively increased in the following order: $\left[128,256,512,512\right]$ .

4.3 Experimental Results

In order to validate the effectiveness of the proposed framework, we conducted an assessment from two perspectives: 1) distribution fidelity, which measures the similarity between the distribution of generated images and real images, and 2) content fidelity, which quantifies how much the generated images maintain the contents of the guided images.

4.3.1 Distribution Fidelity

To evaluate the similarity of the generated image distribution with real images, the performance of the proposed framework was compared with that of the that of the current state-of-the-art GAN model, StyleGAN2 (Karras et al., 2020b), as well as with diffusion models including ADM (Dhariwal and Nichol, 2021) and Diffusion-GAN (Wang et al., 2023b). Note that since StyleGAN2 does not provide evaluation results under the same configuration as the proposed framework, we re-trained StyleGAN2 using the configuration detailed in Section 4.2 to ensure a fair comparison. For this purpose, we utilized the official PyTorch implementation of StyleGAN2-ADA (Karras et al., 2020a), developed by the same authors as StyleGAN2 (Karras et al., 2020b), which has been confirmed to produce results consistent with the TensorFlow version.

The performance of models was primarily evaluated using Precision (P) (Kynkäänniemi et al., 2019) on a set of $50,000$ generated images for each dataset. Precision measures the percentage of generated images that accurately match specific real images in the dataset, indicating how realistic the generated images are. Therefore, Precision is the most suitable metric for assessing the content similarity between generated and reference images, aligning with the objectives of this work. Nevertheless, Fréchet Inception Distance (FID) (Heusel et al., 2017) scores were also provided, as it is a commonly used metric in image generation. FID focuses more on quantifying the diversity of the generated images compared to the desired distribution (Kynkäänniemi et al., 2019).

Table 1: Performance comparisons on FFHQ (Karras et al., 2019), AFHQ Cat and Dog (Choi et al., 2020), and LSUN Bedroom and Church (Yu et al., 2015) in terms of Fréchet Inception Distance (FID) and Precision (P). Precision measures the percentage of generated images that accurately match specific real-world images in the dataset, offering insight into the realism of the generated images. Therefore, higher Precision values imply that a model can produce images preserving the content of reference images.

Dataset	StyleGAN2		ADM		Diffusion-GAN		Proposed
Dataset	P $\uparrow$	FID $\downarrow$	P $\uparrow$	FID $\downarrow$	P $\uparrow$	FID $\downarrow$	P $\uparrow$	FID $\downarrow$
FFHQ	0.694	4.26	0.708	22.18	0.689	2.83	0.812	15.2
AFHQ Cat	0.710	5.29	0.788	12.65	0.612	2.4	0.823	12.52
AFHQ Dog	0.653	29.92	0.822	20.42	0.600	4.83	0.814	30.84
LSUN Bedroom	0.547	3.65	0.660	1.9	0.606	3.65	0.690	14.34
LSUN Church	0.573	3.35	0.496	21.09	0.573	3.17	0.774	10.72

This evaluation results are provided in Table 1. It is important to note that in the experiments conducted in this section, a lowpass filter with a cutoff frequency of $b_{\scriptscriptstyle L}=10$ was applied to all test datasets. The filter type and cutoff value were selected based on the experiments described in Section 4.4.3, where they provided the best precision suitable for the task. According to the results in Table 1, the precision of the proposed framework showed significant improvements compared to StyleGAN: $17\%$ on FFHQ, $16\%$ on AFHQ Cat, $25\%$ on AFHQ Dog, $20\%$ on LSUN Bedroom, and $35\%$ on LSUN Church. Additionally, performance improved by approximately $15\%$ on FFHQ, $4\%$ on AFHQ Cat, $5\%$ on LSUN Bedroom, and $56\%$ on LSUN Church compared to ADM. Compared to Diffusion-GAN, performance improved by approximately $18\%$ on FFHQ, $34\%$ on AFHQ Cat, $36\%$ on AFHQ Dog, $14\%$ on LSUN Bedroom, and $36\%$ on LSUN Church. In contrast, the FID scores of the proposed framework were higher than those of the comparison method. This discrepancy arose because the proposed method emphasizes preserving the attributes of reference images, resulting in generated images that closely resemble specific images in the real dataset. As a consequence, the proposed framework might not fully capture the overall statistical properties of the real dataset. This highlights the trade-off between similarity and diversity in the training of GANs. Further analysis of the tradeoff relationship between Precision and FID in Section 4.4.2 suggests that as Precision increases, FID inevitably increases. Therefore, higher Precision values confirm that the proposed framework can produce images preserving content attributes of reference images, even as FID increases, aligning with our intended goal.

We also visually demonstrate the effectiveness of the proposed framework in Figs. 5, 6, and 7. In the figures, the real input images are displayed in the first column, and the images generated from a projected vector $\mathbf{w}$ in Figure 3 are shown in the first row. The examples of the generated images obviously demonstrate that they preserved the content of the corresponding input real images within the same row. In particular, the generated face images Figure 5 exhibited similar attributes such as age, face shape, pose, direction, hairstyle, and accessories, while other attributes such as skin color, hair color, and ethnicity matched those in the first row. This observation indicates that the color distribution and backgrounds of the generated images were influenced by the injected vector $\mathbf{w}$ in the fine layers of $G_{\scriptscriptstyle\phi}$ , as expected.

As validated by the experiments, the proposed frequency encoding module effectively produces output images that resemble the content of the reference images, thereby providing users with control over the content of the generated images.

4.3.2 Content Fidelity

As specific evaluation metrics to quantify the transfer of content from real images to generated images are not available, we assessed the effectiveness of the proposed framework by examining its ability to preserve attributes of guiding images. However, since attribute labels for other benchmark datasets were unavailable, we employed the CelebA dataset to train attribute classifiers and applied these classifiers to the images of FFHQ. The CelebA dataset includes 40 facial attribute annotations, covering aspects such as gender, hairstyle, and accessories, as detailed in Section 4.1.

To conduct the experiment, we trained multiple binary classifiers for all attributes. These classifiers were then applied to both the guiding images and the generated images to determine the percentage of attributes that matched between them. The binary classifiers were implemented using EfficientNet-B4 (Tan and Le, 2019) and trained with the Adam optimizer. We set the optimizer’s parameters as $\beta_{1}=0.9$ , $\beta_{2}=0.999$ , and a weight decay of 0.01. The training process lasted for $200$ epochs, using a batch size of $512$ . To ensure a smooth learning rate transition, we employed a warm-up strategy, gradually increasing the learning rate from $0$ to $0.00001$ (Goyal et al., 2017) until the first epoch. The training images were obtained by cropping to $90\%$ of their original size. Data augmentation techniques, such as horizontal flipping, random brightness and contrast adjustments, and cutout (DeVries and Taylor, 2017) with four holes (each hole having a maximum edge length of $32$ ) were also applied.

The attribute classification results are exemplified for the output images of Figure 5(a) in Figure 5(b). Each cell of Figure 5(b) contains abbreviations representing attributes corresponding to the images’ positions in Figure 5(a). The attributes include Attractive (A), Bald (B), Bushy Eyebrows (BE), Black Hair (BH), Big Lips (BL), Big Nose (BN), Brown Hair (BrH), Bags Under Eyes (BUE), Chubby (C), Double Chin (DC), Eyeglasses (E), High Cheekbones (HC), Heavy Makeup (HM), Male (M), Mouth Slightly Open (MSO), No Beard (NB), Narrow Eyes (NE), Pale Skin (PS), Smiling (S), Sideburns (Sb), Straight Hair (SH), Wearing Lipstick (WL), Wavy Hair (WaH), Wearing Hat (WH), Young (Y). As expected, the attributes of the output images aligned with the attributes of the images in the same row. For instance, the hairstyle, gender, and glasses of the real images were transferred to the output images, while the background and color characteristics of the generated images were preserved in the outputs. In fact, the average matching accuracy across all attributes is higher than $85\%$ . This demonstrates the effectiveness of the proposed framework in controlling the content of the generated images while retaining other attributes influenced by the injected random vector.

4.4 Extensive Analysis

In addition to the main experiments, we conducted in-depth analyses focusing on key aspects: the impact of the number of EM blocks on performance, the tradeoff between FID and precision, and the influence of frequency bands on image generation.

4.4.1 Impact of EM blocks

To determine the structure of CFM, we conducted experiments to assess the impact of varying quantities of EM blocks on its performance, as shown in Figure 8. Based on the experiment, we configured the CFM with 4 EM blocks. Configurations employing more than 4 EM blocks could decrease the FID score; however, they also showed a considerable escalation in model complexity, offering only marginal performance improvements compared to the 4-block structure. Consequently, the experimental result suggested that the CFM with 4 EM blocks is the optimal choice, as it ensures both satisfactory performance and manageable model complexity.

4.4.2 Tradeoff between FID and Precision

The truncation trick (Brock et al., 2019; Karras et al., 2019; Marchesi, 2017) refers to the practice of restricting the range of random noise input to enhance the similarity between generated images and real images at the expense of reducing diversity. This tradeoff between similarity and diversity is commonly observed in GANs training. When the truncation is maximized, i.e., $\psi=0$ , the variation in the generated images decreases due to the reduced influence of random noise. Conversely, as the truncation threshold increases, the diversity increases, but the perceptual quality of the generated images improves, making them more similar to real images (Kynkäänniemi et al., 2019). This is clearly demonstrated by the analysis results in Figure 9. In the figure, we plotted FID and (1-Precision) instead of Precision to clearly illustrate the tradeoff relationship between them. This choice is made because better performance is represented by a lower FID value, while higher values indicate better performance for Precision.

The observed tradeoff relationship can be directly applied to interpret the performance of the proposed framework, as reported in Section 4.3.1, in the context of its objective. The proposed framework aims to generate output images that closely resemble user-selected, specific real-world images. As a result, the precision of the generated images is significantly improved, indicating a higher percentage of accurate matches with the reference images. However, this emphasis on content preservation might come at the expense of overall statistical diversity, leading to higher FID scores compared to other methods. The intentional tradeoff aligns with the primary objective of the framework, which is to afford users control over the content of the generated images while maintaining an acceptable level of quality, as observed in the examples in Figures 5, 6, and 7. This interpretation is supported by the experimental results presented in Table 1 and Figure 10.

4.4.3 Frequency Analysis

In order to analyze the influence of frequency components on image generation, we experimented by generating images using different cutoff frequencies ( $b_{\scriptscriptstyle L}$ and $b_{\scriptscriptstyle H}$ ) with (10) or (13), along with (7) and (14). We applied this to input images with different frequency limits, enabling us to examine the correlation between frequency manipulation and the resulting generated images.

Figure 10(a) shows the ratio of FID corresponding to different bandwidths of low-pass filtering using (10) and high-pass filtering employing (13) with respect to the FID of images without filtering. The values next to the datasets’ names represent the FID of the unfiltered images for reference purposes. Overall, encoding frequency-selected images into feature embeddings tended to result in improved FID scores. Interestingly, the FID scores improved as more high-frequency components were removed through lowpass filtering. On the other hand, maintaining only high-frequency components did not have a significant impact on the improvement of FID scores, although it yielded better results compared to full-band images. Likewise, increasing the bandwidth of high-pass filtered images led to better FID scores. Based on these observations, we conclude that restricting certain frequency bands improves the quality of generated images in image generation. In particular, maintaining a narrow band of low-frequency components tends to produce higher-quality images. This could be attributed to the characteristics of neural networks, which may face challenges in learning high-frequency components during training.

As already pointed out in Section 4.3, the tradeoff relationship between FID scores and Precision can also be observed in the graphs in Figure 10(a) and (b). This relationship was particularly evident in the models utilizing low-passed images, as depicted in Figure 10(b). Specifically, as Precision increased, the FID score also increased, indicating a decline in the similarity of the overall distribution of real data. Although the tradeoff in high-pass filtered images was less clear, there were instances where higher Precision corresponded to higher FID scores.





FFHQ	AFHQ Cat	AFHQ Dog	LSUN Church	LSUN Bedroom

Figure 12: Examples of the generated image with different frequency bands. The first row displays the real-world images. On the left side of each dataset, the frequency-filtered versions with different bandwidths are displayed, and on the right side are the output images. The second row exhibits results using full-band images. The third row displays low-passed images with

b_{\scriptscriptstyle L}=30

. Lastly, the fourth row shows high-passed images with

b_{\scriptscriptstyle H}=110

. For better visualization, high-pass components larger than the average of the high-pass images are emphasized by multiplying a scalar while suppressing other values.

In addition, Figure 11 displays the attribute classification accuracy on the FFHQ dataset for various cutoff values and classification thresholds (0.3, 0.5, 0.7). To examine the influence of frequency bands on content preservation, we conducted experiments similar to those described in Section 4.3.2, but using different frequency-selected images. The attribute classification results indicate that wider bandwidths of low-pass filters tended to preserve the content characteristics to a slightly greater extent. This observation is consistent with the analysis of Precision, where increasing precision with wider low-pass bandwidth allowed for the preservation of more attributes. Conversely, narrow bandwidth high-pass filters showed a slight advantage in transferring content attributes. While the performance differences may not be significant, the experimental results suggest that manipulating low-frequency components, which contain more abundant spatial information, can assist in controlling the content of generated images. This provides valuable insight into content manipulation techniques.

We also visualize how the output images look like with different cutoff frequencies in Figure 12. These examples provide qualitative evidence that retaining low-frequency components was more effective in aligning the content with the guiding images compared to keeping high-frequency components. This reaffirms the fact that preserving only low-frequency components can enhance image quality, as indicated by the FID scores, and result in higher Precision, although the impact on Precision is minimal, as observed in the graphs of Figure 10.

5 Conclusion

In this paper, we introduced a novel encoder architecture designed to generate images that can retain the content of real-world images, offering users control over image generation. To assess the effectiveness of the proposed framework in controlling image content, we conducted attribute classification and compared characteristics of real-world and generated images. The experimental results demonstrated that the proposed framework successfully transferred content from real-world images to the generated ones, preserving an average of 85% of the input attributes from real face images in the synthetic images of the FFHQ dataset. Furthermore, we investigated the impact of frequency bands on image generation. The associated findings revealed that low-frequency components had a more preferable effect on the FID score, while high-frequency components introduced noise and increased the FID score. We believe this study will inspire further advancements in desired content generation for image generation applications.

References

Afifi et al. (2021) Afifi, M., Brubaker, M.A., Brown, M.S., 2021. HistoGAN: Controlling colors of gan-generated and real images via color histograms, in: IEEE Conf. Comput. Vis. Pattern Recog., pp. 7941–7950.
Agarap (2018) Agarap, A.F., 2018. Deep learning using rectified linear units (relu). arXiv preprint arXiv:1803.08375 .
Bai et al. (2022) Bai, J., Yuan, L., Xia, S.T., Yan, S., Li, Z., Liu, W., 2022. Improving vision transformers by revisiting high-frequency components, in: Eur. Conf. Comput. Vis., pp. 1–18.
Balakrishnan et al. (2022) Balakrishnan, G., Gadde, R., Martinez, A., Perona, P., 2022. Rayleigh eigendirections (reds): Nonlinear gan latent space traversals for multidimensional features, in: Eur. Conf. Comput. Vis, pp. 510–526.
Brock et al. (2021) Brock, A., De, S., Smith, S.L., Simonyan, K., 2021. High-performance large-scale image recognition without normalization, in: Int. Conf. Mach. Learn., pp. 1059–1071.
Brock et al. (2019) Brock, A., Donahue, J., Simonyan, K., 2019. Large scale gan training for high fidelity natural image synthesis.
Chen et al. (2021) Chen, Y., Li, G., Jin, C., Liu, S., Li, T., 2021. SSD-GAN: measuring the realness in the spatial and spectral domains, in: AAAI Cont. Artif. Intell., pp. 1105–1112.
Choi et al. (2018) Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J., 2018. StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation, in: IEEE Conf. Comput. Vis. Pattern Recog., pp. 8789–8797. doi:10.1109/CVPR.2018.00916.
Choi et al. (2020) Choi, Y., Uh, Y., Yoo, J., Ha, J.W., 2020. Stargan v2: Diverse image synthesis for multiple domains, in: IEEE Conf. Comput. Vis. Pattern Recog.
Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L., 2009. ImageNet: A large-scale hierarchical image database, in: IEEE Conf. Comput. Vis. Pattern Recog., pp. 248–255.
DeVries and Taylor (2017) DeVries, T., Taylor, G.W., 2017. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 .
Dhariwal and Nichol (2021) Dhariwal, P., Nichol, A., 2021. Diffusion models beat gans on image synthesis, pp. 8780–8794.
Dosovitskiy et al. (2021) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al., 2021. An image is worth 16x16 words: Transformers for image recognition at scale.
Durall et al. (2020) Durall, R., Keuper, M., Keuper, J., 2020. Watch your up-convolution: Cnn based generative deep neural networks are failing to reproduce spectral distributions, in: IEEE Conf. Comput. Vis. Pattern Recog., pp. 7890–7899.
Dzanic et al. (2020) Dzanic, T., Shah, K., Witherden, F., 2020. Fourier spectrum discrepancies in deep network generated images. Adv. Neural Inform. Process. Syst. 33, 3022–3032.
Ehrlich and Davis (2019) Ehrlich, M., Davis, L.S., 2019. Deep residual learning in the jpeg transform domain, in: IEEE Conf. Comput. Vis. Pattern Recog., pp. 3484–3493.
Frank et al. (2020) Frank, J., Eisenhofer, T., Schönherr, L., Fischer, A., Kolossa, D., Holz, T., 2020. Leveraging frequency analysis for deep fake image recognition, in: Int. Conf. Mach. Learn., pp. 3247–3258.
Fuoli et al. (2021) Fuoli, D., Van Gool, L., Timofte, R., 2021. Fourier space losses for efficient perceptual image super-resolution, in: IEEE Conf. Comput. Vis. Pattern Recog., pp. 2360–2369.
Gal et al. (2021) Gal, R., Hochberg, D.C., Bermano, A., Cohen-Or, D., 2021. SWANGAN: A style-based wavelet-driven generative model. ACM Trans. Graph. 40, 1–11.
Gonzalez (2009) Gonzalez, R.C., 2009. Digital Image Processing. Pearson.
Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y., 2014. Generative adversarial nets, in: Adv. Neural Inform. Process. Syst.
Goyal et al. (2017) Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K., 2017. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 .
Gueguen et al. (2018) Gueguen, L., Sergeev, A., Kadlec, B., Liu, R., Yosinski, J., 2018. Faster neural networks straight from jpeg. Adv. Neural Inform. Process. Syst. 31.
Han et al. (2024a) Han, C., Liang, J.C., Wang, Q., Rabbani, M., Dianat, S., Rao, R., Wu, Y.N., Liu, D., 2024a. Image translation as diffusion visual programmers. arXiv preprint arXiv:2401.09742 .
Han et al. (2024b) Han, C., Lu, Y., Sun, G., Liang, J.C., Cao, Z., Wang, Q., Guan, Q., Dianat, S.A., Rao, R.M., Geng, T., et al., 2024b. Prototypical transformer as unified motion learners.
He et al. (2017) He, K., Gkioxari, G., Dollár, P., Girshick, R., 2017. Mask r-cnn, in: Int. Conf. Comput. Vis., pp. 2961–2969.
He et al. (2016) He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition, in: IEEE Conf. Comput. Vis. Pattern Recog., pp. 770–778.
Hertz et al. (2022) Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D., 2022. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 .
Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S., 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Inform. Process. Syst. 30.
Ho et al. (2020) Ho, J., Jain, A., Abbeel, P., 2020. Denoising diffusion probabilistic models, pp. 6840–6851.
Ho and Salimans (2022) Ho, J., Salimans, T., 2022. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 .
Hu et al. (2018) Hu, J., Shen, L., Sun, G., 2018. Squeeze-and-excitation networks, in: IEEE Conf. Comput. Vis. Pattern Recog., pp. 7132–7141.
Ioffe and Szegedy (2015) Ioffe, S., Szegedy, C., 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift, in: Int. Conf. Mach. Learn., pp. 448–456.
Karnewar and Wang (2020) Karnewar, A., Wang, O., 2020. MSG-GAN: Multi-scale gradients for generative adversarial networks, in: IEEE Conf. Comput. Vis. Pattern Recog.
Karras et al. (2020a) Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., Aila, T., 2020a. Training generative adversarial networks with limited data, in: Adv. Neural Inform. Process. Syst.
Karras et al. (2021) Karras, T., Aittala, M., Laine, S., Härkönen, E., Hellsten, J., Lehtinen, J., Aila, T., 2021. Alias-free generative adversarial networks. Adv. Neural Inform. Process. Syst. 34, 852–863.
Karras et al. (2019) Karras, T., Laine, S., Aila, T., 2019. A style-based generator architecture for generative adversarial networks, in: IEEE Conf. Comput. Vis. Pattern Recog.
Karras et al. (2020b) Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T., 2020b. Analyzing and improving the image quality of stylegan, in: IEEE Conf. Comput. Vis. Pattern Recog.
Khayatkhoei and Elgammal (2022) Khayatkhoei, M., Elgammal, A., 2022. Spatial frequency bias in convolutional generative adversarial networks, in: AAAI Cont. Artif. Intell., pp. 7152–7159.
Kingma and Ba (2015) Kingma, D.P., Ba, J., 2015. Adam: A method for stochastic optimization, in: Int. Conf. Mach. Learn.
Krizhevsky et al. (2009) Krizhevsky, A., Hinton, G., et al., 2009. Learning multiple layers of features from tiny images , 32.
Kynkäänniemi et al. (2019) Kynkäänniemi, T., Karras, T., Laine, S., Lehtinen, J., Aila, T., 2019. Improved precision and recall metric for assessing generative models. Adv. Neural Inform. Process. Syst. 32.
Lee et al. (2022) Lee, K., Chang, H., Jiang, L., Zhang, H., Tu, Z., Liu, C., 2022. VITGAN: Training gans with vision transformers.
Lee and Hirakawa (2022) Lee, Y., Hirakawa, K., 2022. Lossless white balance for improved lossless cfa image and video compression. IEEE Trans. Image Process. 31, 3309–3321.
Lee et al. (2018) Lee, Y., Hirakawa, K., Nguyen, T.Q., 2018. Camera-aware multi-resolution analysis for raw image sensor data compression. IEEE Transactions on Image Processing 27, 2806–2817.
Li et al. (2020) Li, Y., Singh, K.K., Ojha, U., Lee, Y.J., 2020. Mixnmatch: Multifactor disentanglement and encoding for conditional image generation, in: IEEE Conf. Comput. Vis. Pattern Recog., pp. 8039–8048.
Li et al. (2023) Li, Z., Xia, P., Rui, X., Li, B., 2023. Exploring the effect of high-frequency components in gans training. ACM Trans. Multimedia Comput., Commun. Appl. 19, 1–22.
Lin et al. (2014) Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L., 2014. Microsoft COCO: Common objects in context, in: Eur. Conf. Comput. Vis., pp. 740–755.
Liu et al. (2021) Liu, D., Cui, Y., Tan, W., Chen, Y., 2021. Sg-net: Spatial granularity network for one-stage video instance segmentation, in: IEEE Conf. Comput. Vis. Pattern Recog., pp. 9816–9825.
Liu et al. (2015) Liu, Z., Luo, P., Wang, X., Tang, X., 2015. Deep learning face attributes in the wild, in: Int. Conf. Comput. Vis.
Lu et al. (2023) Lu, Y., Wang, Q., Ma, S., Geng, T., Chen, Y.V., Chen, H., Liu, D., 2023. Transflow: Transformer as flow learner, in: IEEE Conf. Comput. Vis. Pattern Recog., pp. 18063–18073.
Mahendren et al. (2023) Mahendren, S., Edussooriya, C.U., Rodrigo, R., 2023. Diverse single image generation with controllable global structure. Neurocomputing 528, 97–112.
Marchesi (2017) Marchesi, M., 2017. Megapixel size image creation using generative adversarial networks. arXiv preprint arXiv:1706.00082 .
Mescheder et al. (2018) Mescheder, L., Geiger, A., Nowozin, S., 2018. Which training methods for GANs do actually converge?, in: Int. Conf. Mach. Learn., pp. 3481–3490.
Mirza and Osindero (2014) Mirza, M., Osindero, S., 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 .
Odena et al. (2017) Odena, A., Olah, C., Shlens, J., 2017. Conditional image synthesis with auxiliary classifier gans, in: Int. Conf. Mach. Learn., pp. 2642–2651.
Oliva et al. (2006) Oliva, A., Torralba, A., Schyns, P.G., 2006. Hybrid images. ACM Trans. Graph. 25, 527–532.
Petras et al. (2019) Petras, K., Ten Oever, S., Jacobs, C., Goffaux, V., 2019. Coarse-to-fine information integration in human vision. NeuroImage 186, 103–112.
Redmon and Farhadi (2018) Redmon, J., Farhadi, A., 2018. Yolov3: An incremental improvement.
Rombach et al. (2022) Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B., 2022. High-resolution image synthesis with latent diffusion models, in: IEEE Conf. Comput. Vis. Pattern Recog., pp. 10684–10695.
Schwarz et al. (2021) Schwarz, K., Liao, Y., Geiger, A., 2021. On the frequency bias of generative models. Adv. Neural Inform. Process. Syst 34, 18126–18136.
Series (2011) Series, B., 2011. Studio encoding parameters of digital television for standard 4: 3 and wide-screen 16: 9 aspect ratios. International Telecommunication Union, Radiocommunication Sector .
Shen et al. (2020) Shen, Y., Gu, J., Tang, X., Zhou, B., 2020. Interpreting the latent space of gans for semantic face editing, in: IEEE Conf. Comput. Vis. Pattern Recog., pp. 9243–9252.
Simonyan and Zisserman (2015) Simonyan, K., Zisserman, A., 2015. Very deep convolutional networks for large-scale image recognition, in: Bengio, Y., LeCun, Y. (Eds.), Int. Conf. Mach. Learn.
Song et al. (2020) Song, J., Meng, C., Ermon, S., 2020. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 .
Strang and Nguyen (1996) Strang, G., Nguyen, T., 1996. Wavelets and filter banks. SIAM.
Szegedy et al. (2016) Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z., 2016. Rethinking the inception architecture for computer vision, in: IEEE Conf. Comput. Vis. Pattern Recog., pp. 2818–2826.
Tan and Le (2019) Tan, M., Le, Q., 2019. Efficientnet: Rethinking model scaling for convolutional neural networks, in: Int. Conf. Mach. Learn., pp. 6105–6114.
Tan and Le (2021) Tan, M., Le, Q., 2021. Efficientnetv2: Smaller models and faster training, in: Int. Conf. Mach. Learn., pp. 10096–10106.
Wang et al. (2020) Wang, H., Wu, X., Huang, Z., Xing, E.P., 2020. High-frequency component helps explain the generalization of convolutional neural networks, in: IEEE Conf. Comput. Vis. Pattern Recog., pp. 8684–8694.
Wang et al. (2023a) Wang, W., Han, C., Zhou, T., Liu, D., 2023a. Visual recognition with deep nearest centroids.
Wang et al. (2023b) Wang, Z., Zheng, H., He, P., Chen, W., Zhou, M., 2023b. Diffusion-gan: Training gans with diffusion.
Yin et al. (2019) Yin, D., Gontijo Lopes, R., Shlens, J., Cubuk, E.D., Gilmer, J., 2019. A fourier perspective on model robustness in computer vision. Adv. Neural Inform. Process. Syst 32.
Yu et al. (2015) Yu, F., Seff, A., Zhang, Y., Song, S., Funkhouser, T., Xiao, J., 2015. LSUN: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365 .
Zhang et al. (2022) Zhang, B., Gu, S., Zhang, B., Bao, J., Chen, D., Wen, F., Wang, Y., Guo, B., 2022. StyleSwin: Transformer-based gan for high-resolution image generation, in: IEEE Conf. Comput. Vis. Pattern Recog., pp. 11304–11314.
Zhang et al. (2021) Zhang, Z., Lu, X., Cao, G., Yang, Y., Jiao, L., Liu, F., 2021. Vit-yolo: Transformer-based yolo for object detection, in: Int. Conf. Comput. Vis., pp. 2799–2808.
Zhou et al. (2021) Zhou, R., Jiang, C., Xu, Q., 2021. A survey on generative adversarial network-based text-to-image synthesis. Neurocomputing 451, 316–336.
Zhuang et al. (2023) Zhuang, P., Abnar, S., Gu, J., Schwing, A., Susskind, J.M., Bautista, M.A., 2023. Diffusion probabilistic fields, in: Int. Conf. Learn. Represent.


(a)	(b)


(a)	(b)	(c)

(d)	(e)	(f)