This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Conditional Idempotent Generative Networks

Niccolò Ronchetti
(June 3, 2024)
Abstract

We propose Conditional Idempotent Generative Networks (CIGN), a novel approach that expands upon Idempotent Generative Networks (IGN) to enable conditional generation. While IGNs offer efficient single-pass generation, they lack the ability to control the content of the generated data. CIGNs address this limitation by incorporating conditioning mechanisms, allowing users to steer the generation process towards specific types of data.

We establish the theoretical foundations for CIGNs, outlining their scope, loss function design, and evaluation metrics. We then present two potential architectures for implementing CIGNs: channel conditioning and filter conditioning. Finally, we discuss experimental results on the MNIST dataset, demonstrating the effectiveness of both approaches. Our findings pave the way for further exploration of CIGNs on larger datasets and with more powerful computing resources to determine the optimal implementation strategy.

1 Introduction

In the paper [10], the authors propose a novel approach to generation which they call Idempotent Generative Network (IGN). One major advantage of this approach compared to diffusion (or autoregressive) models is that generation is done in a single forward pass, and is therefore several times faster.

However, the paper [10] does not touch into how to condition that generation: a well-trained idempotent generative network will generate synthetic data resembling examples seen during training, but the authors do not describe how to nudge the IGN towards generating certain examples (for example, examples belonging to a certain class) over others.

Our paper starts answering that question: we describe how to expand the theory of idempotent generative networks to include a conditioning mechanism, and we call these networks Conditional Idempotent Generative Networks. We further describe two possible architectures for implementing conditional idempotent generative networks, and we discuss experimental results.

Our main contributions are then:

  1. 1.

    We set the theoretical foundations for Conditional Idempotent Generative Networks, describing their scope and intended use, the format of the loss function and what metrics can be used to evaluate them.

  2. 2.

    We describe two possible implementations of the conditioning mechanism of a Conditional Idempotent Generative Network - we call these mechanisms channel conditioning and filter conditioning.

  3. 3.

    We discuss results of our experiments comparing the two implementations of CIGN on the MNIST dataset.

Our experiments show that both channel conditioning and filter conditioning are effective approaches in implementing a Conditional Idempotent Generative Networks. Further experiments on larger datasets and with more powerful computing power are needed to identify which mechanism is more effective.

2 Idempotent generative networks

We start by summarizing the ideas leading to the concept of idempotent generative networks, after [10].

The authors start from the observation that typically the true data distribution lives in a low-dimensional submanifold DD of the feature space XX. Generating a synthetic data point corresponds then to choosing a point on this low-dimensional submanifold DD. One possible way to choose a point on this low-dimensional submanifold DD is to have a function FF whose image is DD, and then apply FF to a random input.

The authors’ idea is to construct FF to be an idempotent function111A function f:XXf:X\longrightarrow X is called idempotent if, when applied multiple times, it produces the same output as applying it once: (f(f(x))=f(x))(f(f(x))=f(x)). on the feature space, with image constrained to be as close as possible to DD.

A machine learning network (called idempotent generative network) is then trained to learn the parameters of the idempotent function F:XXF:X\longrightarrow X with image DD.

The difficulty of creating synthetic data points (in other words - the difficulty of describing an accurate parametrization of the submanifold DD starting from a training dataset) is then subsumed into the difficulty of identifying optimal parameter for the function FF. A well-trained idempotent generative network FF will then generate data points F(x)F(x) when applied to any randomly selected point xx in the feature space XX. In particular, generation happens in a single forward pass and is therefore considerably faster than autoregressive or diffusion generative methods.

We summarize the benefits of using idempotent generative networks:

  • Single-step generation: unlike diffusion or autoregressive models, IGNs generate outputs in one step.

  • Optional refinements: similar to diffusion models, IGNs allow for optional sequential refinements (meaning: we can pass the output through the network a second time, and since the trained F^\hat{F} is not truly idempotent, we may obtain a slightly higher quality output - this was discussed in [10]).

  • No feature/latent space conundrum: unlike diffusion models, IGNs work only on the feature space, facilitating manipulations and interpolations of inputs and outputs.

  • Output enhancement: an IGN could be used to project degraded data back onto the original data distribution.

2.1 Loss function

While the initial version of the paper [10] does not provide insights on the model architectures used for the experiments the authors describe, the paper does discuss at length the loss function of an IGN from a theoretical perspective.

Let again XX be the feature space and DD be the data manifold. Let F:XXF:X\longrightarrow X be the idempotent function we are trying to construct and whose parameters are being learned by the IGN.

The loss function is then composed of the following three terms:

  • (reconstruction) Real data samples dd from the data distribution DD should remain unchanged by the model: F(d)=dF(d)=d.

  • (idempotent) Any further application of the model beyond the first one should act like the identity function: F(F(z))=F(z)F(F(z))=F(z) for every zz in XX.

  • (tightness) The previous two conditions should be satisfied while ensuring FF has as small a range as possible.

We encourage the interested reader to check out additional details in the original paper [10].

2.2 Network architecture

The theoretical framework of idempotent generative network is fundamentally architecture-agnostic. And yet, one needs to choose a model architecture in order to test the feasibility of the idea in practice.

In [10], the authors do not explicitly describe the architecture used in their experiments. Inspired by [8], we will leverage a GAN-like architecture.

In the classic setup of Generative Adversarial Networks, a generator and a discriminator compete in a zero-sum game: the generator creates synthetic data points resembling as much as possible a real dataset, while the discriminator attempts to distinguish the real data points from the synthetic ones. In particular, in the GAN framework the discriminator is fed synthetic data points created by the generator.

In the IGN setting, instead, the two models will work in the opposite order: the discriminator dd processes input data into a latent tensor, and the generator gg creates a synthetic data point from that latent tensor. Our idempotent generative network is then F=gdF=g\circ d.

3 Conditioning idempotent generative networks

Idempotent generative networks are then both simple and elegant from a theoretical perspective, as well as powerful from a practical one: in particular IGNs can generate data in a single step, without the multi-steps denoising process implemented by diffusion models or the ‘single-token-generation’ approach of autoregressive text generation models.

However, the approach taken in [10] leaves one crucial question unanswered: can we condition the model to generate new samples with prescribed characteristics?

We present multiple approaches to answer affirmatively the previous question.

3.1 Theoretical perspective

Let XX be again the feature space and DD be the data manifold, immersed into XX. Suppose that every data point in DD comes with a condition. This could for example be a label or a caption (in case DD represent images). Let CC be the ‘condition’ space, the space where conditions cc belong to.

We can now extend the paradigm behind idempotent generative network to consider conditioning: we have an augmented data manifold D~\widetilde{D} immersed in X×CX\times C, every point (d,c)(d,c) consisting of a data point dd and its condition cc. We want to find an idempotent map

F:X×CX×CF:X\times C\longrightarrow X\times C

such that:

  • (reconstruction) F(d,c)=(d,c)F(d,c)=(d,c) for every augmented data point (d,c)D~(d,c)\in\widetilde{D}.

  • (idempotence) F(F(x,c))=F(x,c)F(F(x,c))=F(x,c) for every (x,c)(x,c) in X×CX\times C.

  • (tightness) the range of FF is as small as possible, i.e. F(X×C)F(X\times C) is a low-dimensional submanifold of X×CX\times C.

As explained in [10], the tightness condition is crucial to avoid the model learning to replicate the identity function (which satisfies the idempotence and the reconstruction requirements).

We plan on using the model as follows: suppose we want to generate a sample with condition cc. We randomly select a noisy sample zXz\in X, we calculate F(z,c)X×CF(z,c)\in X\times C and we extract the first component d~\tilde{d} of F(z,c)F(z,c).

3.2 Loss function

Even a perfect model FF satisfying the three conditions in section 3.1 above is not guaranteed to return a data point d~\tilde{d} related to condition cc. Indeed, the idempotent function FF is not being explicitly constrained to return a data point d~\tilde{d} related to condition cc, just a data point in the extended data manifold.

In order to help the model learn the conditioning, we treat mismatched data points like noisy examples. A mismatched data point is a pair (d,c^)(d,\hat{c}) where (d,c)(d,c) belongs to the augmented data manifold D~\widetilde{D} and cc^c\neq\hat{c}. In other words, we tell the model that (d,c^)(d,\hat{c}) should not be part of the range. By treating mismatched data points like noisy samples, we teach the network to not act like the identity function on them.

Let Dc={dD|(d,c)D~}D_{c}=\left\{d\in D\,|\,(d,c)\in\widetilde{D}\right\} be the subset of the data manifold DD consisting of points with condition cc. We are then teaching the model FF to have range contained in D~\widetilde{D} (tightness and reconstruction conditions), but that Dc×c^D_{c}\times{\hat{c}} should not be part of the range for any c^c\hat{c}\neq c (mismatched conditions). By exclusion this forces the model FF to further restrict its range to the unions of Dc×cD_{c}\times{c} as cc varies in CC.

The loss function for conditional idempotent generative networks is then made of five terms:

  1. 1.

    (reconstruction loss LrecL_{rec}) Real data samples (d,c)(d,c) from the augmented data manifold D~\widetilde{D} should remain unchanged by the model: F(d,c)=(d,c)F(d,c)=(d,c).

  2. 2.

    (idempotent on noisy samples LidemnoiseL_{idem}^{noise}) For any noisy sample zz and any condition cc, any further application of the model beyond the first one should act like the identity function: F(F(z,c))=F(z,c)F(F(z,c))=F(z,c).

  3. 3.

    (tightness on noisy samples LtightnoiseL_{tight}^{noise}) The idempotence condition on noisy samples and the reconstruction condition should be satisfied while ensuring FF has as small a range as possible.

  4. 4.

    (idempotent on mismatched samples LidemmismL_{idem}^{mism}) For a data point (d,c)(d,c) on the augmented data manifolds D~\widetilde{D} and a condition c^\hat{c} different from cc, any further application of the model beyond the first one should act like the identity function: F(F(d,c^))=F(d,c^)F(F(d,\hat{c}))=F(d,\hat{c}).

  5. 5.

    (tightness on mismatched samples LtightmismL_{tight}^{mism}) The idempotence condition on mismatched samples and the reconstruction condition should be satisfied while ensuring FF has as small a range as possible.

Each one of these conditions is implemented via the choice of a distance function δ\delta on the space X×CX\times C, which is used to compare a data point before and after the application of FF.222For simplicity one usually picks the same distance function δ\delta for all loss terms, although this is not theoretically necessary. For example the reconstruction loss is implemented as

Lrec(d,c)=δ((d,c),F(d,c))(d,c)D~.L_{rec}(d,c)=\delta\left((d,c),F(d,c)\right)\quad\forall(d,c)\in\widetilde{D}.

The total loss is then obtained as a weighted sum of these five terms:

L=wrecLrec+widemnoiseLidemnoise+wtightnoiseLtightnoise+widemmismLidemmism+wtightmismLtightmism.L=w_{rec}\cdot L_{rec}+w_{idem}^{noise}\cdot L_{idem}^{noise}+w_{tight}^{noise}\cdot L_{tight}^{noise}+w_{idem}^{mism}\cdot L_{idem}^{mism}+w_{tight}^{mism}\cdot L_{tight}^{mism}.

Each one of the five weights is a hyperparameter that can be tuned during training.

During training, every batch will then be composed of three types of data points:

  • true data points (d,c)(d,c) in the augmented data manifold D~\widetilde{D}.

  • noisy samples (z,c)(z,c) where zz is a randomly sampled from XX following a prior distribution.

  • mismatched samples (d,c^)(d,\hat{c}) for (d,c)D~(d,c)\in\widetilde{D} and c^c\hat{c}\neq c.

The relative proportion of the three types of data points within a batch is a training hyperparameter.

3.3 Metrics

We discuss some theoretical considerations regarding what metrics should be used to evaluate Conditional Generative Idempotent Networks.

At a high level, an appropriate metric should measure the following two dimensions:

  • (realism) The synthetic data point F(z,c)F(z,c) generated from condition cc should indeed look like a sample from DcD_{c}.

  • (diversity) Synthetic data points F(z,c)F(z,c) generated from the same condition cc as we vary the noisy input zz should space over as large a subset of XX as possible.

There is a fairly standard set of metrics typically used to measure these two dimensions for Generative Adversarial Networks, for example Inception Score (IS) and Fréchet Inception Distance (FID). These metrics have been adapted to the setup of conditional generative adversarial networks (see for example [1]).

More complex metrics (see for example [4]) have been suggested for conditional image generation models, especially if the generation is implemented via a diffusion process. In this latter scenario, human evaluation is most of the time still the most accurate performance metric.

All of these metrics are reasonable choices and the one to be preferred mostly depends on the specific use case. If the condition space CC is a finite set of classes, metrics based on the inception score and the Fréchet inception distance are probably adequate. If the condition space CC is a continuous manifold (for example, cc can be any English sentence), then human evaluation is likely to be preferred.

4 Two proposed implementations of CIGN

In this section, we described two possible implementations of the Conditional Idempotent Generative Networks.

For both implementations, we assume that the data has a channel dimension (for example audio, images or video data).

Let (x,c)(x,c) be an element in X×CX\times C. We denote by (n,S)(n,S) the shape of the tensor xx, where nn is the channel dimension and S=(s1,,sk)S=(s_{1},\ldots,s_{k}) represents any number of additional dimensions (for example, if xx is an image then S=(h,w)S=(h,w) is a pair of height and width dimensions).

4.1 Channel conditioning

Let E:CSE:C\longrightarrow\mathbb{R}^{S} be an embedding function, where we denote by S=s1××sk\mathbb{R}^{S}=\mathbb{R}^{s_{1}}\times\ldots\times\mathbb{R}^{s_{k}} the subspace corresponding to the non-channel dimensions of the feature space XX.

Channel conditioning is implemented by concatenating E(c)E(c) and xx along the channel dimension.

This is a very generic and flexible idea, with many different variations available:

  • We can build different embeddings Ei(c)E_{i}(c) whose target space consists of the non-channel dimensions of the input xix_{i} to layer ii of the network, and then concatenate Ei(c)E_{i}(c) to xix_{i}.

  • We can build embeddings whose target space is multiple copies of S\mathbb{R}^{S}, so that when concatenating along the channel diemnsion we give the model additional flexibility on learning from the condition.

In section 5.3 we describe our experiments on training a conditional idempotent generative network with channel conditioning on the MNIST dataset.

4.2 Filter conditioning

Let S=(s1,,sk)S^{\prime}=(s_{1}^{\prime},\ldots,s_{k}^{\prime}) be any kk-tuple with 0<si0<s_{i} for any 1ik1\leq i\leq k. Let E:CSE:C\longrightarrow\mathbb{R}^{S^{\prime}} be an embedding function.

Filter conditioning is implemented by calculating cross correlation (or transposed cross correlation) between xx and E(c)E(c) along the non-channel dimensions. This is inspired by the concept of correlation filters for object detection (see for example [2]).

One can think of filter conditioning as replacing a standard or transposed convolutional layer (where the model learns the kernel’s weights and those weights are shared across all inputs) by a simpler cross-correlation layer, where each input is (transpose) cross-correlated to its condition’s embedding, and the model learns the embedding layer’s weights.

This also is a very generic and flexible idea, with many different variations available:

  • We can choose different filter dimensions SS^{\prime} for the embedding’s target space - in fact, SS^{\prime} can be considered as a hyperparameter of the model.

  • We can replace any number of convolutional layers (of a base model architecture that we start with) with filter conditioning layers where the filter dimensions SS^{\prime} are the dimensions of the original convolutional layer.

  • We can concatenate the result of a standard convolutional layer with the result of a filter conditioning layer, assuming that two layers’ parameters (kernel’s shape, padding, stride, dilation) are identical.

5 Experiments on the MNIST dataset

In this section we discuss our experiments implementing CIGNs on the MNIST dataset. All experiments were run on consumer hardware - specifically one single free GPU offered by Sagemaker Studio Lab, with a maximum session time of 4 hours.

While this severely limited our experiments (in terms of the model size and in terms of the number of hyperparameter runs we experimented with), we are still able to successfully train well-performing Conditional Idempotent Generative Networks (CIGN), able to generate different good quality images from the same noisy data point when prompted with different conditions.

5.1 Preprocessing steps and high-level architecture

The MNIST dataset from [12] consists of images with a single channel and a size of 28×2828\times 28 pixels, valued in [0,1][0,1]. We preprocess the dataset by transforming images into torch tensors, and then linearly rescaling pixels to be valued between 1-1 and 11. We add some small random noise to the training data, to fight overfitting.

In the setup of the previous sections, we have then the feature space

X=1×28×28,X=\mathbb{R}^{1}\times\mathbb{R}^{28}\times\mathbb{R}^{28},

with the condition space

C={0,,9}C=\{0,...,9\}

being the label set of the MNIST dataset.

Noisy data points are sampled from XX according to a normal distribution centered at 0 and with standard deviation 11, and then clipped so that they belong to [1,1]1×28×28\left[-1,1\right]^{1\times 28\times 28}.

The distance function δ\delta used to implement the loss function is

δ((x1,c1),(x2,c2))=L1(x1x2)=|x1x2|1,\delta\big{(}(x_{1},c_{1}),(x_{2},c_{2})\big{)}=L^{1}\big{(}x_{1}-x_{2}\big{)}=\left|x_{1}-x_{2}\right|_{1},

the L1L^{1} norm of the difference between the feature space components. While this is not strictly speaking a distance on X×CX\times C, it works for our purposes since the models we implement pass through the condition cc ‘as-is’ (meaning that F(x,c)=(F1(x,c),c)F(x,c)=\left(F_{1}(x,c),c\right)), and for the purpose of loss calculation we only need to calculate the distance between (x,c)(x,c) and F(x,c)F(x,c).

At a high level, the architecture of our Conditional Idempotent Generative Networks is composed of:

  • a discriminator with domain X×CX\times C, outputting a pair of a latent tensor and the condition cc;

  • a generator, with input the latent tensor and the condition cc, and range X×CX\times C.

The CIGN is then obtained by running the discriminator first and the generator second.

The architectures of the discriminator and the generator are heavily inspired by the DCGAN framework, in its pytorch implementation ([6]).

The discriminator is defined as a sequence of convolutional layers with:

  • an increasing number of channels, starting from a latent dimension ldiml_{dim} and doubling each layer, with the final latent tensor having size (idim,1,1)(i_{dim},1,1).333We denote by idimi_{dim} the intermediate dimension. We call this intermediate because this is obtained exactly halfway through the forward pass of the entire CIGN, when the latent tensor is outputted by the discriminator and passed to the generator.

  • decreasing non-channel dimensions, until the latent tensor has non-channel dimensions (1,1)(1,1) - that is to say, it is a 1-dimensional tensor.

The generator is defined by mirroring exactly the layers of the discriminator, but in the opposite order: a sequence of transpose convolutional layers with decreasing number of channels and increasing non-channel dimensions. The parameters of each convolutional layer of the discriminator (kernel size, stide, padding, dilation) are equal to those of the corresponding transpose convolutional layer of the generator.

5.2 Metrics

Our main metric is the average Fréchet Inception Distance across all 10 classes. We remark that this is identical to the metric named WCFID in section 3.2 of [1].

For each class, we compare the 1000 samples available in the validation dataset from [12] to 1000 synthetic samples generated by the model under consideration.

This comparison is implemented via the pytorch-fid library (see [7]): this library leverages the ImageNet-v3 model to get intermediate embeddings of the images, which are used to estimate the parameters of the data distribution under a normality assumption.

The aforementioned implementation [7] allows the user to choose among four different embeddings (each one passing the image through a partial inception-v3 model [11] up to a certain layer), but one of them is not available to us due to the small size of the distributions we are comparing (just 1000 samples). The other three embeddings output respectively a 6464-dimensional tensor, a 192192-dimensional tensor and a 768768-dimensional tensor. We denote the corresponding metrics by FID64\mathrm{FID}_{64}, FID192\mathrm{FID}_{192} and FID768\mathrm{FID}_{768}.

5.3 Channel conditioning

We implement the channel conditioning approach from section 4.1 in its simplest form:

  1. 1.

    we train one shared embedding layer

    E:Cembdim,E:C\longrightarrow\mathbb{R}^{\mathrm{emb_{dim}}},
  2. 2.

    and then for each (possibly transpose) convolutional layer where the input xix_{i} has non-channel dimensions (hi,wi)(h_{i},w_{i}), we train one linear layer

    Li:embdimhiwi,L_{i}:\mathbb{R}^{\mathrm{emb_{dim}}}\longrightarrow\mathbb{R}^{h_{i}\cdot w_{i}},

    and we concatenate Li(E(c))L_{i}(E(c)) to xix_{i} along the channel dimension, before the forward pass of the (possibly transpose) convolutional layer.

5.3.1 Training details

We run two channel conditioning experiments where the models have different hyperparameters, as in table 1.

Table 1: Parameters for channel conditioning experiments
Model size Embedding dim Latent dim Intermediate dim Num parameters
small 5 32 128 1.55m
large 5 64 128 5.6m

For each real data point in the augmented data manifold, we create nine mismatched data points by pairing the image with all other labels except the correct one. The number of noisy images in each batch is equal to the number of real data points. Our proportion of real:noise:mismatched data points is then 1:1:9.

We train the model for 50 epochs, with a learning rate of 0.0010.001 and a batch size of 128128 (this means that each backward pass is actually calculated on 128×11=1408128\times 11=1408 data points, due to the 1:1:9 ratio mentioned in the previous paragraph).

We experiment with a few different choices of weights for the five summands of the loss function described in 3.2, and ultimately we choose the following values:

  • wrec=20.0w_{rec}=20.0

  • widemnoise=20.0w_{idem}^{noise}=20.0

  • wtightnoise=2.5w_{tight}^{noise}=2.5

  • widemmism=3.0w_{idem}^{mism}=3.0

  • wtightmism=1.0w_{tight}^{mism}=1.0

The detailed architecture of generator and discriminator are described in appendix A.1, including figures 1 and 2.

5.4 Filter conditioning

We implement the filter conditioning approach from section 4.2 in its simplest form:

  1. 1.

    we train one shared embedding layer

    E:Cembdim,E:C\longrightarrow\mathbb{R}^{\mathrm{emb_{dim}}},
  2. 2.

    for each (possibly transpose) convolutional layer where the standard DCGAN architecture in [6] has input xix_{i} with cic_{i} channels and a kernel of size (hi,wi)(h_{i},w_{i}), we train one linear layer

    Li:embdimcihiwi,L_{i}:\mathbb{R}^{\mathrm{emb_{dim}}}\longrightarrow\mathbb{R}^{c_{i}\cdot h_{i}\cdot w_{i}},
  3. 3.

    we calculate channel-wise (possibly transpose) cross-correlation between xix_{i} and Li(E(c))L_{i}(E(c)),

  4. 4.

    for each (possibly transpose) convolutional layer where the standard DCGAN architecture has cinc_{in} input channels and coutc_{out} output channels, we train a channel mixer linear layer

    Ci:cincout,C_{i}:\mathbb{R}^{c_{in}}\longrightarrow\mathbb{R}^{c_{out}},

    which we apply to the output of the cross-correlation from the previous step.

5.4.1 Training details

We run two filter conditioning experiments where the models have different hyperparameters, as in table 2.

Table 2: Parameters for channel conditioning experiments
Model size Embedding dim Latent dim Intermediate dim Num parameters
small 5 64 128 616k
large 5 96 256 1.38m

For each real data point in the augmented data manifold, we create nine mismatched data points by pairing the image with all other labels except the correct one. The number of noisy images in each batch is equal to the number of real data points. Our proportion of real:noise:mismatched data points is then 1:1:9.

We train the model for 50 epochs, with a learning rate of 0.00010.0001 and a batch size of 6464 (this means that each backward pass is actually calculated on 64×11=70464\times 11=704 data points, due to the 1:1:9 ratio mentioned in the previous paragraph).

We experiment with a few different choices of weights for the five summands of the loss function described in 3.2, and ultimately we choose the following values:

  • wrec=20.0w_{rec}=20.0

  • widemnoise=20.0w_{idem}^{noise}=20.0

  • wtightnoise=2.5w_{tight}^{noise}=2.5

  • widemmism=8.0w_{idem}^{mism}=8.0

  • wtightmism=1.0w_{tight}^{mism}=1.0

The detailed architecture of generator and discriminator is described in appendix A.2, including figures 3 and 4.

5.5 Results

In appendices B.1 and B.2 we display samples of the images generated by the best experiments.

In this section we detail the metrics calculated on our trained experiments.

As discussed in section 5.2, our metric is the average Fréchet Inception Distance (FID) across the ten classes. For each trained experiment and each class, we compare the 10001000 samples on the MNIST validation dataset with 10001000 samples generated by the trained experiment. We also report in table 3 minimum and maximum FID across all ten classes.

Table 3: FID metrics for all experiments
filter_large filter_small channel_large channel_small
FID64\mathrm{FID}_{64} mean 0.8278 0.3601 0.1149 0.6586
min 0.0837 0.0959 0.0455 0.1126
max 3.25 0.7516 0.272 2.1216
FID192\mathrm{FID}_{192} mean 3.3095 1.792 0.6598 2.7955
min 0.4912 0.5992 0.3735 0.5908
max 12.5165 5.1655 1.1886 8.2649
FID768\mathrm{FID}_{768} mean 0.4423 0.3384 0.2852 0.4619
min 0.2485 0.1869 0.1576 0.3557
max 0.8765 0.4641 0.5184 0.7127

Essentially all metrics point to the large experiment trained with channel conditioning as the best experiment of the four. On the other hand, that experiment has about 4x as many trainable parameters as the next two experiments (the small experiment trained with channel-conditioning and the large experiment trained with filter conditioning).

The metrics for the small experiment with filter conditioning (the smallest model of all, with only  616k trainable parameters) are clearly better than the small experiment with channel conditioning.

The metrics for the large experiment with filter conditioning are unusually off. Human review of the synthetic images generated by this model suggests that it tends to create ‘thick font’ images (see appendix B.2) which is likely the reason for the large FID values and therefore poor results. We have not investigated why this experiment tends to generate ‘thick font’ images.

In conclusion, we believe that the jury is still out regarding which of the conditioning mechanism is better - the best performing experiment is the large model with channel conditioning, but the small filter conditioning model clearly outperforms a slightly larger model trained with channel conditioning.

References

  • [1] Benny, Y., Galanti, T., Benaim, S., & Wolf, L. (2021). Evaluation metrics for conditional image generation. International Journal of Computer Vision, 129, 1712-1731.
  • [2] Bertinetto, L., Valmadre, J., Henriques, J. F., Vedaldi, A., & Torr, P. H. (2016). Fully-convolutional siamese networks for object tracking. In Computer Vision-ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part II 14 (pp. 850-865). Springer International Publishing.
  • [3] Engel, J., Agrawal, K. K., Chen, S., Gulrajani, I., Donahue, C., Roberts, A. (2019). Gansynth: Adversarial neural audio synthesis. arXiv preprint arXiv:1902.08710.
  • [4] Ku, M., Li, T., Zhang, K., Lu, Y., Fu, X., Zhuang, W., & Chen, W. (2023). Imagenhub: Standardizing the evaluation of conditional image generation models. arXiv preprint arXiv:2310.01596.
  • [5] Netron web application at https://netron.app/
  • [6] Official Pytorch implementation of [9]. https://github.com/pytorch/examples/tree/main/dcgan
  • [7] Pytorch implementation of Fréchet inception distance, available at https://github.com/mseitzer/pytorch-fid
  • [8] Pulfer, B. implementation of IGN from original paper [10], available at https://github.com/brianpulfer/idempotent-generative-network
  • [9] Radford, A., Metz, L., Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.
  • [10] Shocher A, Dravid A, Gandelsman Y, Mosseri I, Rubinstein M, Efros AA. Idempotent generative network. arXiv preprint arXiv:2311.01462. 2023 Nov 2.
  • [11] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., … & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1-9).
  • [12] MNIST dataset from torchvision, available at https://pytorch.org/vision/stable/generated/torchvision.datasets.MNIST.html

Appendix A Model architectures

In this appendix, we describe in details the model architectures of our experiments from section 5. As before, ldiml_{dim} is the latent dimension and idimi_{dim} is the intermediate dimension.

We also report figures displaying the architectures of our experiment (all these figures were obtained with Netron [5]).

A.1 Channel conditioning architectures

The discriminator is defined as five consecutive convolutional layers. Each convolutional layer ii consists of the following five steps:

  1. 1.

    batch normalization (except for the initial layer)

  2. 2.

    concatenation of the input tensor and the embedding of the condition Li(E(c))L_{i}(E(c)) (passed through a hyperbolic tangent activation) along the channel dimension

  3. 3.

    convolutional layer

  4. 4.

    dropout layer with rate 0.150.15

  5. 5.

    activation function (LeakyReLU with slope 0.20.2 except for the last layer which uses the sigmoid function)

Table 4 describes the specific parameters of each convolutional layer.

Table 4: Convolutional layer parameters for discriminator with channel conditioning
Layer Input size Output size Kernel size Stride Padding
1 (1, 28, 28) (ldiml_{dim}, 14, 14) (4, 4) 2 1
2 (ldiml_{dim}, 14, 14) (ldim2l_{dim}*2, 7, 7) (4, 4) 2 1
3 (ldim2l_{dim}*2, 7, 7) (ldim4l_{dim}*4, 4, 4) (3, 3) 2 1
4 (ldim4l_{dim}*4, 4, 4) (ldim8l_{dim}*8, 2, 2) (4, 4) 2 1
5 (ldim8l_{dim}*8, 2, 2) (idimi_{dim}, 1, 1) (2, 2) 1 0

The generator is defined as five consecutive transpose convolutional layers. Each transpose convolutional layer consists of the following five steps:

  1. 1.

    batch normalization

  2. 2.

    concatenation of the input tensor and the embedding of the condition, along the channel dimension

  3. 3.

    transpose convolutional layer

  4. 4.

    dropout layer with rate 0.150.15 (except the last layer)

  5. 5.

    activation function (ReLU except for the last layer which uses the hyperbolic tangent function)

Table 5 describes the specific parameters of each transpose convolutional layer.

Table 5: Convolutional layer parameters for generator with channel conditioning
Layer Input size Output size Kernel size Stride Padding
1 (idimi_{dim}, 1, 1) (ldim8l_{dim}*8, 2, 2) (2, 2) 1 0
2 (ldim8l_{dim}*8, 2, 2) (ldim4l_{dim}*4, 4, 4) (4, 4) 2 1
3 (ldim4l_{dim}*4, 4, 4) (ldim2l_{dim}*2, 7, 7) (3, 3) 2 1
4 (ldim2l_{dim}*2, 7, 7) (ldiml_{dim}, 14, 14) (4, 4) 2 1
5 (ldiml_{dim}, 14, 14) (1, 28, 28) (4, 4) 2 1
Refer to caption
Figure 1: Architecture of the generator with channel conditioning
Refer to caption
Figure 2: Architecture of the discriminator with channel conditioning

A.2 Filter conditioning architectures

The discriminator is defined as five consecutive ‘convolutional’ layers. Each convolutional layer consists of the following five steps:

  1. 1.

    batch normalization (except for the initial layer)

  2. 2.

    channel-wise cross-correlation between the input tensor xix_{i} and the embedding of the condition Li(E(c))L_{i}(E(c)) (passed through a hyperbolic tangent activation)

  3. 3.

    channel mixer layer CiC_{i}, applied on the channel dimension

  4. 4.

    dropout layer with rate 0.150.15

  5. 5.

    activation function (LeakyReLU with slope 0.20.2 except for the last layer which uses the sigmoid function)

Table 6 describes the specific parameters of each convolutional layer.

Table 6: Convolutional layer parameters for discriminator with filter conditioning
Layer Input size Kernel Li(E(c))L_{i}(E(c)) size Stride Padding
1 (1, 28, 28) (1, 4, 4) 2 1
2 (ldiml_{dim}, 14, 14) (ldiml_{dim}, 4, 4) 2 1
3 (ldim2l_{dim}*2, 7, 7) (ldim2l_{dim}*2, 3, 3) 2 1
4 (ldim4l_{dim}*4, 4, 4) (ldim4l_{dim}*4, 4, 4) 2 1
5 (ldim8l_{dim}*8, 2, 2) (ldim8l_{dim}*8, 2, 2) 1 0

The generator is defined as five consecutive ‘transpose convolutional’ layers. Each transpose convolutional layer consists of the following five steps:

  1. 1.

    batch normalization

  2. 2.

    channel-wise transpose cross-correlation between the input tensor xix_{i} and the embedding of the condition Li(E(c))L_{i}(E(c)) (passed through a hyperbolic tangent activation)

  3. 3.

    channel mixer layer CiC_{i}, applied on the channel dimension

  4. 4.

    dropout layer with rate 0.150.15 (except the last layer)

  5. 5.

    activation function (ReLU except for the last layer which uses the hyperbolic tangent function)

Table 7 describes the specific parameters of each transpose convolutional layer.

Table 7: Convolutional layer parameters for generator with filter conditioning
Layer Input size Kernel Li(E(c))L_{i}(E(c)) size Stride Padding
1 (idimi_{dim}, 1, 1) (idimi_{dim}, 2, 2) 1 0
2 (ldim8l_{dim}*8, 2, 2) (ldim8l_{dim}*8, 4, 4) 2 1
3 (ldim4l_{dim}*4, 4, 4) (ldim4l_{dim}*4, 3, 3) 2 1
4 (ldim2l_{dim}*2, 7, 7) (ldim2l_{dim}*2, 4, 4) 2 1
5 (ldiml_{dim}, 14, 14) (ldiml_{dim}, 4, 4) 2 1
Refer to caption
Figure 3: Architecture of the generator with filter conditioning
Refer to caption
Figure 4: Architecture of the discriminator with filter conditioning

Appendix B Generated images

In this section, we display synthetic images generated by our experiments on the MNIST dataset. Each group of image is generated by one experiment, starting from the same noisy sample but with different conditions.

B.1 Channel conditioning experiments

The following images were generated by the best experiments trained with the channel conditioning mechanism described in section 5.3.

Refer to caption
Figure 5: Images generated by small model with channel conditioning
Refer to caption
Figure 6: Images generated by large model with channel conditioning

B.2 Filter conditioning experiments

The following images were generated by the best experiments trained with the filter conditioning mechanism described in section 5.4.

Refer to caption
Figure 7: Images generated by small model with filter conditioning
Refer to caption
Figure 8: Images generated by large model with filter conditioning