This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Imaginative Walks: Generative Random Walk Deviation Loss for Improved Unseen Learning Representation

Divyansh Jha111denotes equal contributions., Kai Yi, Ivan Skorokhodov, Mohamed Elhoseiny
Abstract

We propose a novel loss for generative models, dubbed as GRaWD (Generative Random Walk Deviation), to improve learning representations of unexplored visual spaces. Quality learning representation of unseen classes (or styles) is critical to facilitate novel image generation and better generative understanding of unseen visual classes, i.e., zero-shot learning (ZSL). By generating representations of unseen classes based on their semantic descriptions, e.g., attributes or text, generative ZSL attempts to differentiate unseen from seen categories. The proposed GRaWD loss is defined by constructing a dynamic graph that includes the seen class/style centers and generated samples in the current minibatch. Our loss initiates a random walk probability from each center through visual generations produced from hallucinated unseen classes. As a deviation signal, we encourage the random walk to eventually land after tt steps in a feature representation that is difficult to classify as any of the seen classes. We demonstrate that the proposed loss can improve unseen class representation quality inductively on text-based ZSL benchmarks on CUB and NABirds datasets and attribute-based ZSL benchmarks on AWA2, SUN, and aPY datasets. In addition, we investigate the ability of the proposed loss to generate meaningful novel visual art on the WikiArt dataset. The results of experiments and human evaluations demonstrate that the proposed GRaWD loss can improve StyleGAN1 and StyleGAN2 generation quality and create novel art that is significantly more preferable. Our code is made publicly available at https://github.com/Vision-CAIR/GRaWD.

Introduction

Refer to caption
Figure 1: GRaWD loss encourages generatively visiting the orange realistic space, aiming to deviate from the seen classes and avoid the less real red space. Our loss starts from each seen class ( in green) and performs a random walk through generated examples of hallucinated unseen classes (in orange) for TT steps. We then encourage the landing representation to be distinguishable from seen classes. With this property, the proposed loss helps improve generalized ZSL performance and novel art creation.

Generative models like GANs (Goodfellow et al. 2014) and VAEs (Kingma and Welling 2013) are excellent tools for generating realistic images due to their ability to represent high-dimensional probability distributions. However, they are not explicitly trained to go beyond distribution seen during training. In recent years, generative models have been adopted to go beyond training data distributions and improve unseen class recognition (also known as zero-shot learning) (Guo et al. 2017a; Long et al. 2017; Guo et al. 2017b; Kumar Verma et al. 2018; Zhu et al. 2018; Vyas, Venkateswara, and Panchanathan 2020). These methods train a conditional generative model G(sk,z)G(s_{k},z) (Mirza and Osindero 2014; Odena, Olah, and Shlens 2017), where sks_{k} is the semantic description of class kk (attributes or text descriptions) and zz represents within-class variation (e.g., z𝒩(0,I)z\in\mathcal{N}(0,I)). After training, G(sk,z)G(s_{k},z) is used to generate imaginary data for unseen classes transforming ZSL into a traditional classification task trained on the generated data. Understanding unseen classes is mainly leveraged by the generative model’s improved ability to produce discriminative visual features/representations using G(su,z)G(s_{u},z) from their corresponding unseen semantic descriptions (i.e., sus_{u}).

Refer to caption
Figure 2: Art images on top with orange borders are generated using our loss. The bottom part shows the Nearest Neighbors (NN) in the training set (with green borders), which are different indicating our generations’ novelty.

To generate likable novel visual content, GANs’ training has been augmented with a loss that encourages careful deviation from existing classes (Elgammal et al. 2017a; Sbai et al. 2018; Hertzmann 2018). Such models were shown to have some capability to produce unseen aesthetic art (Elgammal et al. 2017a), fashion (Sbai et al. 2018; Wu et al. 2021), and design (Nobari, Rashad, and Ahmed 2021). In a generalized ZSL context, CIZSL (Elhoseiny and Elfeki 2019) showed an improved performance by modeling a similar deviation to encourage discrimination explicitly between seen and unseen classes. These losses improve unseen representation quality by encouraging the produced visual generations to be distinguishable from seen classes in ZSL (Elhoseiny and Elfeki 2019) and seen styles in art  (Elgammal et al. 2017a) and fashion (Sbai et al. 2018) generation.

We propose Generative Random Walk Deviation Loss (GRaWD) as a parameter-free graph-based loss to improve learning representation of unseen classes; see Fig. 1. Our loss starts from each seen class (in green) and performs a random walk through the generated examples of hallucinated unseen classes (in orange) for TT steps. Then, we encourage the landing representation to be distant and distinguishable from the seen class centers. GRaWD loss is computed over a similarity graph involving seen class centers and generated examples in the current minibatch of hallucinated unseen classes. Thus, GRaWD takes a global view of the data manifold compared to existing deviation losses that are local/per example (e.g.,  Sbai et al. (2018); Elgammal et al. (2017a); Elhoseiny and Elfeki (2019)). In contrast to transductive methods (e.g.,  Vyas, Venkateswara, and Panchanathan (2020)), our loss is purely inductive; therefore, does not require real descriptions of unseen classes during training. Our work is connected to recent advances in semi-supervised learning (e.g., Zhang et al. (2018); Ayyad et al. (2020); Ren et al. (2018); Haeusser, Mordvintsev, and Cremers (2017); Li et al. (2019)) that leverage unlabeled data within the training classes. In these methods, unlabeled data are encouraged to be attracted to existing classes. Our goal is the opposite, deviating from seen classes. Also, our loss operates on generated data of hallucinated unseen classes instead of provided unlabeled data.

Contribution. We propose a generative random walk loss that leverages generated data by exploring the unseen embedding space discriminatively against the seen classes; see Fig.  1. Our loss is unsupervised on the generative space and can be applied to any GAN architecture (e.g., DCGAN (Radford, Metz, and Chintala 2016), StyleGAN  (Karras, Laine, and Aila 2019a), and StyleGAN2  (Karras et al. 2020)). We show that our GRaWD loss helps understand unseen visual classes better, improving generalized zero-shot learning tasks on challenging benchmarks. We also show that compared to existing deviation losses, GRaWD improves the generative capability in unseen space of liked art; see Fig. 2.

Related Work

Refer to caption
Figure 3: Generative Random Walk Deviation loss starts from each seen class center (i.e., 𝐜i\mathbf{c}_{i}). It then performs a random walk through generated examples of hallucinated unseen classes using G(su,z)G(s_{u},z) for TT steps. The landing probability distribution of the random walk is encouraged to be uniform over the seen classes. For careful deviation from seen classes, the generated images are encouraged to be classified as real by the Discriminator DD; see Eq. 4.

Generative Models with Deviation Losses. In the context of computational creativity, several approaches have been proposed to produce original items with aesthetic and meaningful characteristics  (Machado and Cardoso 2000; Mordvintsev, Olah, and Tyka 2015; DiPaola and Gabora 2009; Tendulkar et al. 2019). Various early studies have made progress on writing pop songs (Briot, Hadjeres, and Pachet 2017), and transferring styles of great painters to other images  (Gatys, Ecker, and Bethge 2016; Date, Ganesan, and Oates 2017; Dumoulin et al. 2017; Johnson, Alahi, and Li 2016; Isola et al. 2017) or doodling sketches (Ha and Eck 2018). The creative space of the style transfer images is limited by the content image and the transfer image, which could be an artistic image by Van Gogh. GANs (Goodfellow et al. 2014; Radford, Metz, and Chintala 2016; Ha and Eck 2018; Reed et al. 2016; Zhang et al. 2017; Karras et al. 2018; Karras, Laine, and Aila 2019a) have a capability to learn visual distributions and produce images from a latent zz vector. However, they are not trained explicitly to produce novel content beyond the training data. More recent work explored an early capability to produce novel art with CAN (Elgammal et al. 2017b), and fashion designs with a holistic CAN (an improved version of CAN) (Sbai et al. 2018), which are based on augmenting DCGAN (Radford, Metz, and Chintala 2016) with a loss encouraging deviation from existing styles. The difference between CAN and holistic-CAN is that the deviation signal is Binary Cross Entropy over individual styles for CAN (Elgammal et al. 2017b) and Multi-Class Cross Entropy (MCE) loss overall styles in Holistic-CAN (Sbai et al. 2018). Similar deviation losses were proposed in CIZSL (Elhoseiny and Elfeki 2019) for ZSL.

In contrast to these deviation losses, our loss is more global as it establishes dynamic messages between generations that are produced every mini-batch iteration and seen visual spaces. These generations should deviate from seen class spaces represented by class centers. In our experiments, we applied our loss on unseen class recognition and producing novel visual generations, showing superior performance compared to existing losses. We also note that random walks have been explored in the literature in the context of semi-supervised and few-shot learning for attracting unlabeled data points to its corresponding class (e.g. Ayyad et al. (2020); Haeusser, Mordvintsev, and Cremers (2017)). In contrast, we develop a random walk-based method to deviate from seen classes, which is an opposite objective, and operates on generated data rather than unlabeled data that are not available in purely inductive setups; see Fig. 1.

Zero-Shot Learning Methods. Classical ZSL methods directly predict attribute confidence from images to facilitate zero-shot recognition (e.g., seminal works by  Lampert, Nickisch, and Harmeling (2009a, 2013) and Farhadi et al. (2009)). Current ZSL methods can be classified into two branches. One branch casts the task as a visual-semantic embedding problem (Frome et al. 2013; Skorokhodov and Elhoseiny 2021; Liu et al. 2020). Akata et al. (2015, 2016) proposed Attribute Label Embedding(ALE) to model visual-semantic embedding as a bilinear compatibility function between the image space and the attribute space. In (Zhang, Xiang, and Gong 2016), deep ZSL methods were presented to model the non-linear mapping between vision and class descriptions. In the context of ZSL from noisy textual descriptions, an early linear approach for Wikipedia-based ZSL was proposed in (Elhoseiny, Saleh, and Elgammal 2013). Orthogonal to these improvements, generative models like GANs (Goodfellow et al. 2014) and VAEs (Kingma and Welling 2013) have been adopted to formulate multi-modality in zero-shot recognition by synthesizing visual features of unseen classes given its semantic description, e.g., (Kumar Verma et al. 2018; Zhu et al. 2018; Schonfeld et al. 2019; Narayan et al. 2020; Han et al. 2021; Chen et al. 2021). Zhu et al. (2018) introduced a GAN model with a classification head with the standard real/fake head to improve text-based ZSL. Schonfeld et al. (2019) proposed a cross and distribution aligned VAE to better leverage the seen and unseen relationships. Han et al. (2021) utilized a generative network along with a multi-level supervised contrastive embedding strategy to learn images and semantic relationships. Our GRaWD loss helps improve the out-of-distribution performance of generative ZSL models.

Approach

We start this section by the formulation of our Generative Random Walk Deviation loss. We will show later in this section how it can be integrated with both generative ZSL models to improve unseen class recognition and with state-of-the-art deep-GAN models to encourage novel visual generations. We denote the generator as G(s,z)G(s,z) and its corresponding parameters as θG\theta_{G}. As in  (Xian et al. 2018b; Zhu et al. 2018; Elhoseiny and Elfeki 2019; Felix et al. 2018), the semantic representation ss can be concatenated with a random vector zZz\in\mathbb{R}^{Z} sampled from a Gaussian distribution pz=𝒩(0,1)p_{z}=\mathcal{N}(0,1) to generate an image for visual art generation or visual features in the case of zero-shot learning. Hence, G(sk,z)G(s_{k},z) is the generated image / feature from the semantic description sks_{k} of class kk and the noise vector zz. We denote the discriminator as DD and its corresponding parameters as θD\theta_{D}. The discriminator is trained with two objectives: (1) predict real for images from the training images and fake for generated ones. (2) identify the category of the input image. The discriminator then has two heads. The first head is for binary real/fake classification; {0,1}\{0,1\} classifier. The second head is a KsK^{s}-way classifier over the seen classes. We denote the real/fake probability produced by DD for an input image as Dr()D^{r}(\cdot), and the classification score of a seen class k𝒮k\in\mathcal{S} given the image as Ds,k()D^{s,k}(\cdot).

Generative Random Walk Deviation Loss

We sample NuN_{u} examples that we aim them to deviate from the seen classes/styles in the current minibatch with the generator G()G(\cdot) . We denote the features of the these hallucinated generations as Xu={x1uxNuu}X_{u}=\{x^{u}_{1}\cdots x^{u}_{N_{u}}\}. These features are extracted by ϕ()\phi(\cdot), a feature extraction function that we define as the activations from the last layer of the Discriminator DD followed by scaled L2 normalization L2(𝐯,β)=β𝐯𝐯L2(\mathbf{v},\beta)=\beta\frac{\mathbf{v}}{\|\mathbf{v}\|}. The scaled factor is mainly to amplify the norm of the vectors to avoid the vanishing gradient problem inspired from (Bell et al. 2016). We used β=3\beta=3 guided by (Bell et al. 2016; Zhang et al. 2019). We denote the seen class centers that we aim to deviate from as C={𝐜1𝐜Ks}C=\{\mathbf{c}_{1}\cdots\mathbf{c}_{K^{s}}\}, defined in the same feature space as XuX_{u}, where 𝐜i\mathbf{c}_{i} represents center of seen class/style ii. The formulation of 𝐜i\mathbf{c}_{i} depends on the application (e.g., zero-shot learning or novel art generation), defined later in this section.

Let BNu×KsB\in\mathbb{R}^{N_{u}\times K^{s}} be the similarity matrix between each of the features of the generations (xuXux^{u}\in X_{u} ) and the cluster centers (cCc\in C ). Similarly, let ANu×NuA\in\mathbb{R}^{N_{u}\times N_{u}} compute the similarity matrix between the generated points. In particular, we use the negative Euclidean distances between the embeddings as a similarity measure as follows:

Bij=xi𝐜j2,Ai,j=xiuxju2B_{ij}=-\|x_{i}-\mathbf{c}_{j}\|^{2},\,\,\,A_{i,j}=-\|x^{u}_{i}-x^{u}_{j}\|^{2} (1)

where xiux^{u}_{i} and xjux^{u}_{j} are ithi^{th} and jthj^{th} features in the set XuX_{u}; see Fig. 3. To avoid self-cycle, The diagonal entries Ai,iA_{i,i} are set to a small number ϵ\epsilon. Hence, we defined three transition probability matrices:

P𝐂Xu=σ(B𝕋),PXu𝐂=σ(B),PXuXu=σ(A)P^{\mathbf{C}\rightarrow X_{u}}={\sigma}(B^{\mathbb{T}}),\,\,\,P^{X_{u}\rightarrow\mathbf{C}}=\sigma(B),\,\,\,P^{X_{u}\rightarrow X_{u}}=\sigma(A) (2)

where σ\sigma is the softmax operator is applied over each row of the input matrix, P𝐂XuP^{\mathbf{C}\rightarrow X_{u}} and PXu𝐂P^{X_{u}\rightarrow\mathbf{C}} are the transition probability matrices from each seen class over the NuN_{u} generated points and vice-versa respectively. PXuXuP^{X_{u}\rightarrow X_{u}} is the transition probability matrix from each generated point over other generated points. We hence define our generative random walker probability matrix as:

P𝐂𝐂(t,Xu)=σ(B𝕋)(σ(A))tσ(B)P^{\mathbf{C}\rightarrow\mathbf{C}}(t,X_{u})={\sigma}(B^{\mathbb{T}})\cdot(\sigma(A))^{t}\cdot\sigma(B) (3)

where Pi,j𝐂𝐂(t,Xu)P^{\mathbf{C}\rightarrow\mathbf{C}}_{i,j}(t,X_{u}) denotes the probability of ending a random walk of a length tt at a seen class jj given that we have started at seen class ii; tt denotes the number of steps taken between the generated points, before stepping back to land on a seen class/style.

Loss. Our random walk loss aims at boosting the deviation of unseen visual spaces from seen classes. Hence, we define our loss by encouraging each row in P𝐂𝐂(t)P^{\mathbf{C}\rightarrow\mathbf{C}}(t) to be hard to classify to seen classes as follows

LGRW(Xu)=\displaystyle{L}_{GRW}(X_{u})= t=0Tγti=1Ksj=1KsUc(j)log(Pi,j𝐂𝐂(t,Xu))\displaystyle-\sum_{t=0}^{T}{\gamma^{t}\cdot\sum_{i=1}^{K^{s}}\sum_{j=1}^{K^{s}}{U}_{c}(j)log({P_{i,j}^{\mathbf{C}\rightarrow\mathbf{C}}}(t,X_{u}))} (4)
j=1NuUx(j)log(Pv(j))\displaystyle-\sum_{j=1}^{N_{u}}{U}_{x}(j)log(P_{v}(j))

where first term minimizes cross entropy loss between every row in P𝐂𝐂(t,Xu)t=1TP^{\mathbf{C}\rightarrow\mathbf{C}}(t,X_{u})\forall t=1\to T and uniform distribution over seen classes Uc(j)=1Ks,j=1Ks{U}_{c}(j)={\frac{1}{K^{s}},\forall j=1\cdots K^{s}}, where TT is a hyperparameter and γ\gamma is exponential decay set to 0.70.7 in our experiments. In the second term, we maximizes the probability of all the generations xiuXux^{u}_{i}\in X_{u} to be equality visited by the random walk. Note that, if we replaced Uc{U}_{c} by an identity matrix to encourage landing to the starting seen class, the loss becomes an attraction signal similar to  (Haeusser, Mordvintsev, and Cremers 2017), which defines its conceptual difference to GRaWD. We call this version GRaWT, T for aTraction. The second term is called ‘visit loss’ was proposed in (Haeusser, Mordvintsev, and Cremers 2017) to encourage random walker to visit a large set of unlabeled points. We compute the overall probability that each generated point would be visited by any of the seen class Pv=1Nui=0NcPi𝐂XuP_{v}=\frac{1}{N_{u}}\sum_{i=0}^{N_{c}}{P^{\mathbf{C}\rightarrow X_{u}}_{i}}, where Pi𝐂XuP^{\mathbf{C}\rightarrow X_{u}}_{i} represents the ithi^{th} row of the P𝐂XuP^{\mathbf{C}\rightarrow X_{u}} matrix; see Fig. 3. The visit loss is then defined as the cross-entropy between PvP_{v} and the uniform distribution Ux(j)=1Nu,j=1Nu{U}_{x}(j)={\frac{1}{N^{u}},\forall j=1\cdots N^{u}}. Hence, visit loss encourages to visit as many examples as possible from XuX_{u} and hence improves learning representation.

GRaWD Integration with Generative ZSL

Let’s denote the set of seen and unseen class labels as 𝒮\mathcal{S} and 𝒰\mathcal{U}, where 𝒮𝒰=\mathcal{S}\cap\mathcal{U}=\emptyset. We denote the semantic representations of unseen classes and seen classes as su=ψ(Tu)𝒯s_{u}=\psi(T_{u})\in\mathcal{T} and si=ψ(Ti)𝒯s_{i}=\psi(T_{i})\in\mathcal{T} respectively, where 𝒯\mathcal{T} is the semantic space and ψ()\psi(\cdot) is the semantic description function that extract features from text article or attribute description of class kk. Let’s denote the seen data as Ds={(xis,yis,si)}D^{s}=\{(x_{i}^{s},y_{i}^{s},s_{i})\}, where xis𝒳x_{i}^{s}\in\mathcal{X} denotes the visual features of the ithi^{th} image, yis𝒮y_{i}^{s}\in\mathcal{S} is the corresponding seen category label. For unseen classes, we are given only their semantic representations, one per class, sus_{u}. We define KuK^{u} as the number of unseen classes. In Generalized ZSL (GZSL), we aim to predict the label y𝒰𝒮y\in\mathcal{U}\cup\mathcal{S} at test time given xx that may belong to seen or unseen classes. We represent seen classes as C={𝐜1𝐜Ks}C=\{\mathbf{c}_{1}\cdots\mathbf{c}_{K^{s}}\}, where 𝐜i\mathbf{c}_{i} represents the center of class ii that we define as

𝐜i=ϕ(G(z=𝟎,si))\mathbf{c}_{i}=\phi(G(z=\mathbf{0},s_{i})) (5)

where sis_{i} is the attribute or text description of seen class ii. Xu={x1uxNuu}X_{u}=\{x^{u}_{1}\cdots x^{u}_{N_{u}}\} are sampled by ϕ(G(z,su))\phi(G(z,s_{u})) where zpz=𝒩(0,I)z\sim p_{z}=\mathcal{N}(0,I), supus_{u}\sim p_{u} is a semantic description of a hallucinated unseen class. We explicitly explore the unseen/creative space of the generator GG with a hallucinated semantic representation supus_{u}\sim p^{u}, where pup^{u} is a probability distribution over unseen classes, aimed to be likely hard negatives to seen classes. We sample supus_{u}\sim p^{u} following the strategy proposed in (Elhoseiny and Elfeki 2019) due to its simplicity and effectiveness. It picks two seen semantic descriptions at random sa,sb𝒮s_{a},s_{b}\in\mathcal{S}. We then sample su=αsa+(1α)sbs^{u}=\alpha s_{a}+(1-\alpha)s_{b}, where α\alpha is uniformly sampled between 0.20.2 and 0.80.8. Note that of α\alpha values near 0 or 11 are discarded to avoid sampling semantic descriptions that are very close to seen classes. We then integrate LGRW(Xu){L}_{GRW}(X_{u}) with the Generator GG loss as follows.

LG=\displaystyle L_{G}= λ𝔼Xuϕ(G(su,z))),zpz,supu[LGRW(Xu)]\displaystyle{\lambda\mathbb{E}_{X_{u}\sim\phi(G({s_{u}},z))),z\sim p_{z},s_{u}\sim{p^{u}}}[{L}_{GRW}(X_{u})]} (6)
𝔼zpz,supu[Dr(G(su,z))]\displaystyle-\mathbb{E}_{z\sim{p_{z},s_{u}\sim{p^{u}}}}[D^{r}(G(s_{u},z))]
𝔼zpz,(sk,ys)ps[Dr(G(sk,z))\displaystyle-\mathbb{E}_{z\sim{p_{z},(s_{k},y^{s})\sim{p^{s}}}}[D^{r}(G(s_{k},z))
+k=1Ksykslog(Ds,k(G(sk,z)))]\displaystyle+\sum_{k=1}^{K^{s}}y^{s}_{k}log(D^{s,k}(G(s_{k},z)))]

Here, the first term is the proposed GRaWD loss. The second and the third terms trick the generator into classifying the visual generations from both the seen semantic descriptions sks_{k} and unseen semantic descriptions sus_{u}, as real. The fourth term encourages the generator to discriminatively generate visual features conditioned on a given seen class description. We then define the Discriminator DD loss as

LD=𝔼zpz,supu\displaystyle L_{D}=\hskip 2.84544pt\mathbb{E}_{z\sim{p_{z},s_{u}\sim{p^{u}}}} [Dr(G(su,z))]\displaystyle[D^{r}(G(s_{u},z))]\,\,\,\,\,\,\,\,\,\,\, (7)
+𝔼zpz,(sk,ys)ps\displaystyle+\mathbb{E}_{z\sim{p_{z}},(s_{k},y^{s})\sim{p^{s}}} [Dr(G(sk,z))]𝔼xpd[Dr(x)]\displaystyle[D^{r}(G(s_{k},z))]-\mathbb{E}_{x\sim{p_{d}}}[D^{r}(x)]
+LLip12𝔼x,ypd\displaystyle+L_{Lip}-\frac{1}{2}\mathbb{E}_{x,y\sim{p_{d}}} [k=1Ksyklog(Ds,k(x))]\displaystyle[\sum_{k=1}^{K^{s}}y_{k}log(D^{s,k}(x))]
12𝔼zpz,(sk,ys)ps\displaystyle-\frac{1}{2}\mathbb{E}_{z\sim p_{z},(s_{k},y^{s})\sim{p^{s}}} [k=1Ksykslog(Ds,k(G(sk,z)))]\displaystyle[\sum_{k=1}^{K^{s}}y^{s}_{k}log(D^{s,k}(G(s_{k},z)))]

Here, image xx and corresponding class one-hot label yy are sampled from the data distribution pdp_{d}. sks_{k} and ysy^{s} are features of a semantic description and the corresponding one-hot label sampled from seen classes psp^{s}. The first three terms approximate Wasserstein distance of the distribution of real features and fake features, and fourth term is the gradient penalty to enforce the Lipschitz constraint: LLip=(x~Dr(x~)21)2L_{Lip}=(||\bigtriangledown_{\tilde{x}}D^{r}(\tilde{x})||_{2}-1)^{2}, where x~\tilde{x} is the linear interpolation of the real feature xx and the fake feature x^\hat{x}; see (Gulrajani et al. 2017). The last two terms are the classification losses of the real and generated data to their corresponding classes.

GRaWD Integration with StyleGANs for Novel Art Generation

We integrated our loss with DCGAN (Radford, Metz, and Chintala 2015), StyleGAN  (Karras, Laine, and Aila 2019a) and StyleGAN2 (Karras et al. 2020) by simply adding LGRWL_{GRW} in Eq. 4 to the generator loss. We assume to have NsN_{s} seen art styles that we aim to deviate from. Here, we define C={𝐜1𝐜Ks}C=\{\mathbf{c}_{1}\cdots\mathbf{c}_{K^{s}}\} by sampling a small episodic memory of size mm for every class and computing 𝐜i\mathbf{c}_{i} from discriminator features. We randomly sample m=10m=10 examples per class once and compute its mean representation at each iteration. We provide more training details in supplementary.

Table 1: Ablation studies on CUB Dataset (text).
Setting CUB-Easy CUB-Hard
Top-1 Acc (%) SU-AUC (%) Top1-Acc (%) SU-AUC (%)
Deviation losses + GRaWT (T=0) 44.0 39.5 13.7 11.8
+ GRaWT (T=3) 43.4 38.8 13.2 11.4
on GAZSL (Zhu et al. 2018) + Classify G(su,z)G(s_{u},z) as class Ks+1K^{s+1} 43.2 38.3 11.31 9.5
+ CIZSL(Elhoseiny and Elfeki 2019) 44.6 39.2 14.4 11.9
Walk length + GRaWD (T=1) 45.41 39.62 13.79 12.58
+ GRaWD (T=3) 45.11 39.25 14.21 13.22
on GAZSL (Zhu et al. 2018) + GRaWD (T=5) 45.40 40.51 14.00 13.07
+ GRaWD (T=10) 45.43 40.68 15.51 13.70
Table 2: Zero-Shot Recognition from textual description on CUB and NAB datasets (Easy and Hard Splits) shwoing that adding GRaWD loss can improve the performance. tr means the transductive setting.
Metric Top-1 Accuracy (%) Seen-Unseen AUC (%)
Dataset CUB NAB CUB NAB
Split-Mode Easy Hard Easy Hard Easy Hard Easy Hard
ZSLNS (Qiao et al. 2016) 29.1 7.3 24.5 6.8 14.7 4.4 9.3 2.3
SynCfast (Changpinyo et al. 2016) 28.0 8.6 18.4 3.8 13.1 4.0 2.7 3.5
ZSLPP (Elhoseiny et al. 2017) 37.2 9.7 30.3 8.1 30.4 6.1 12.6 3.5
FeatGen (Xian et al. 2018b) 43.9 9.8 36.2 8.7 34.1 7.4 21.3 5.6
LsrGAN (tr) (Vyas et al. 2020) 45.2 14.2 36.4 9.0 39.5 12.1 23.2 6.4
   +GRaWD 45.6+0.4 15.1+0.9 37.8+1.4 9.7+0.7 39.9+0.4 13.3+1.2 24.5+1.3 6.7+0.3
GAZSL (Zhu et al. 2018) 43.7 10.3 35.6 8.6 35.4 8.7 20.4 5.8
   +CIZSL (Elhoseiny and Elfeki 2019) 44.6 14.4 36.6 9.3 39.2 11.9 24.5 6.4
   + GRaWD 45.4+1.7 15.5 +5.2 38.4 +2.8 10.1 +1.5 40.7 +5.3 13.7+5.0 25.8+5.4 7.4 +1.6
Table 3: Attribute based ZSL on AwA2, aPY and SUN. Compared with Haeusser et al. (2017).
AwA2 aPY SUN
H S U H S U H S U
GRaWT (T=0) 32.3 80.5 20.2 23.0 78.9 13.4 26.0 31.6 22.2
GRaWT (T=3) 31.6 80.7 19.7 22.4 75.8 13.1 25.8 31.1 22.1
GRaWD 39.0 88.3 25.0 27.2 83.2 16.3 27.9 37.3 22.3
Table 4: Zero-Shot Recognition on class-level attributes of AwA2, aPY and SUN datasets, showing that GRaWD loss can improve the performance on attribute-based datasets.
Top-1 Accuracy(%) Seen-Unseen H
AwA2 aPY SUN AwA2 aPY SUN
SJE (Akata et al. 2015) 61.9 35.2 53.7 14.4 6.9 19.8
LATEM (Xian et al. 2016) 55.8 35.2 55.3 20.0 0.2 19.5
ALE (Akata et al. 2016) 62.5 39.7 58.1 23.9 8.7 26.3
SYNC (Changpinyo et al. 2016) 46.6 23.9 56.3 18.0 13.3 13.4
SAE (Kodirov, Xiang, and Gong 2017) 54.1 8.3 40.3 2.2 0.9 11.8
DEM (Zhang, Xiang, and Gong 2016) 67.1 35.0 61.9 25.1 19.4 25.6
FeatGen (Xian et al. 2018b) 54.3 42.6 60.8 17.6 21.4 24.9
cycle-(U)WGAN (Felix et al. 2018) 56.2 44.6 60.3 19.2 23.6 24.4
LsrGAN (tr) (Vyas et al. 2020) 60.160.1^{*} 34.634.6^{*} 62.5 48.748.7^{*} 31.531.5^{*} 44.8
   + GRaWD 63.7+3.6 35.5+0.9 64.2+1.7 49.2+0.5 32.7+1.2 46.1+1.3
GAZSL (Zhu et al. 2018) 58.9 41.1 61.3 15.4 24.0 26.7
   + CIZSL (Elhoseiny and Elfeki 2019) 67.8 42.1 63.7 24.6 25.7 27.8
   + GRaWD 68.4+9.5 43.3+2.2 62.1+0.8 39.0+23.6 27.2+3.2 27.9+1.2

Experiments

Purely Inductive Generative ZSL Experiments

Purely Inductive Evaluation in ZSL: Our focus in this paper is to learn a good representation of unseen visual spaces without accessing any unseen class information during training. However, most recent papers jointly train an extra classifier (e.g., an MLP) with their proposed generative model (Narayan et al. 2020; Han et al. 2021). More concretely, this classifier is trained with generated X~u=G(uk,z)\tilde{X}_{u}=G(u_{k},z) from unseen semantic description uku_{k}. While in GZSL, training seen images XsX_{s} along with X~u\tilde{X}_{u} is introduced as input of the extra classifier. We refer to methods that assume access to unseen class descriptions during training as semantic transductive ZSL (even if not using unlabeled images of unseen classes). Accessing unseen information before evaluation is not in line with our focus on learning generative unseen learning representation. This is also less realistic if we aim at purely inductive zero-shot learning. Following purely inductive ZSL settings (e.g.,  (Zhu et al. 2018; Elhoseiny and Elfeki 2019)), we use NN-classification on the generated features for evaluation, which avoids accessing any unseen semantic information before testing.

We performed experiments on existing ZSL benchmarks with text descriptions and attributes as semantic class descriptions. Note that the text description setting is more challenging because it is at the class-level and is extracted from Wikipedia, which is noisier. We found that random walk steps TT easy to tune using the validation set.

Text-Based ZSL. We performed our text-based ZSL experiments on Caltech UCSD Birds-2011 (CUB) (Wah et al. 2011) containing 200 classes with 11, 788 images and North America Birds (NAB) (Van Horn et al. 2015) which has 1011 classes with 48, 562 images. We use two metrics widely used in evaluating ZSL recognition performance: standard zero-shot recognition with the Top-1 unseen class accuracy and Seen-Unseen Generalized Zero-shot performance with Area under Seen-Unseen curve (Chao et al. 2016). We follow  (Chao et al. 2016; Zhu et al. 2018; Elhoseiny and Elfeki 2019) in using the Area Under SUC to evaluate the generalization capability of class-level text zero-shot recognition on four splits (CUB Easy, CUB Hard, NAB Easy, and NAB Hard). The hard splits are constructed such that unseen bird classes from super-categories do not overlap with seen classes. Our proposed loss function improves over older methods on all datasets on both Easy and SCE(hard) splits, as shown in Table 4. We show improvements in the range of 0.8-1.8% Top-1 accuracy. We also show improvements in AUC, ranging from 1-1.8%. From Table 4, we show that GAZSL (Zhu et al. 2018)+GRaWD has an average relative Seen-Uneen AUC improvement over GAZSL (Zhu et al. 2018)+CIZSL (Elhoseiny and Elfeki 2019) and GAZSL (Zhu et al. 2018) only of 9.29% and 30.89%. We achieved SOTA results for text datasets. In Table 4, we performed an ablation study where we show that longer random walks performed better hence giving higher accuracies and AUC scores for both easy and hard split for CUB Dataset. With longer walks, the model could have a more holistic view of the generated visual representation in a way that enables better deviation of unseen classes from unseen classes. Therefore, we used T=10 for our experiments.

Attribute-Based ZSL. We performed experiments on the widely used GBU (Xian et al. 2018a) setup, where we use class attributes as semantic descriptors. We performed these experiments on the AwA2 (Lampert, Nickisch, and Harmeling 2009b), aPY (Farhadi et al. 2009), and SUN (Patterson and Hays 2012) datasets. In Table 4, we see that GRaWD outperforms all of the existing methods on seen-unseen harmonic mean for AwA2, aPY, and SUN datasets. In the case of the AwA2 dataset, it outperformed the compared method by a significant margin, i.e., 15.1%. It is also competent with existing methods in Top-1 accuracy while improving on AwA2 4.8%. From Table 4, GAZSL (Zhu et al. 2018)+GRaWD has an average relative improvement over GAZSL (Zhu et al. 2018)+CIZSL (Elhoseiny and Elfeki 2019) and GAZSL (Zhu et al. 2018) of 24.92% and 61.35% in harmonic mean.

Table 4 and 4 show that deviation signal in GRaWD is critical to achieve better performance since the calculated metrics are much better for GRaWD compared to GRaWT for both text-based and attribute based ZSL. The performance can severely degrade without the deviation signal. Tab. 4 (bottom section) shows that longer walk lengths benefits the training as model is encouraged to globally explore larger segments of unseen representations’ manifold.

GRaWD Loss for Transductive ZSL. We also apply our GRaWD loss to transductive ZSL setting and choose choose LsrGAN (Vyas, Venkateswara, and Panchanathan 2020) as the baseline model. Our loss can also improve LsrGAN on both text-based datasets and attribute-based datasets on most metrics ranging from 0.3%-3.6%. Despite that our loss does not use unseen class descriptors, it can still improve on average on LsrGAN (transductive) by 1.96% on attribute datasets and 2.91% on text-based datasets. However, in line with our expectations, the former improvement in the purely inductive setting is more significant.

Novel Art Generation Experiments.

We performed our Art experiments on the WikiArt dataset, containing 81k images of 27 different styles (WikiArt 2015).

Table 5: Human experiments on generated art from vanilla GAN, GRaWD, and CAN losses. Models trained on GRaWD obtained the highest mean likeability in all the groups. More people believed the generated art to be real for the artwork generated from model trained using GRaWD.

Likeability Mean Turing Test Loss Architecture Q1-mean(std) NN \uparrow NN \downarrow Entropy \uparrow Random Q2(% Artist) CAN (Elgammal et al. 2017b) DCGAN 3.20(1.50) - - - - 53 GAN (Vanilla) StyleGAN 3.12(0.58) 3.07 3.36 3.00 3.06 55.33 CAN StyleGAN 3.20(0.62) 3.01 3.61 3.05 3.11 56.55 RW-T3 (Ours) StyleGAN 3.29(0.59) 3.05 3.58 3.13 3.38 54.08 RW-T10 (Ours) StyleGAN 3.29(0.63) 3.15 3.67 3.15 3.17 58.63 GAN (Vanilla) StyleGAN2 3.02(1.15) 2.89 3.30 2.79 3.09 54.01 CAN StyleGAN2 3.23(1.16) 3.27 3.34 3.11 3.21 57.9 RW-T3 (Ours) StyleGAN2 3.40(1.1) 3.30 3.61 3.33 3.35 64.0

Table 6: Normalized mean ranking (lower the better) calculated from the likeability experiment. We take the mean rating of each artwork on both CAN and GRaWD losses. We then stack, sort, normalize them to compute the normalized mean rank. The numbers are corresponding normalized ranks from the models in the row above them.

Normalized Mean Ranks CAN/RW-T10 CAN/RW-T3 CAN/RW-T10/RW-T3 StyleGAN1 0.53/0.47 0.53/0.47 0.52/0.48/0.50 CAN/RW-T3 GAN/RW-T3 CAN/GAN/RW-T3 StyleGAN2 0.54/0.46 0.59/0.41 0.49/0.59/0.42

Baselines. We performed comparisons with two baselines, i.e., (1) the vanilla GAN for the chosen architecture, and (2) adding Holistic-CAN loss (Sbai et al. 2018) (an improved version of CAN (Elgammal et al. 2017b)). For simplicity, we refer the Holistic-CAN as CAN.

Nomenclature. Here, the models are referred as RW-T(value), where RW means GRaWD loss, and T is the number of steps. We name our models according to this convention throughout. We perform human subject experiments to evaluate generated art. We set value of the loss coefficient λ\lambda as 10. We divide the generations from these models into four groups, each containing 100 images; see examples in Fig. 4.

  • NN\uparrow. Images with high nearest neighbor (NN) distance from the training dataset.

  • NN\downarrow. Images with low NN distance from the training dataset.

  • Entropy \uparrow. Images with high entropy of the probabilities from a style classifier.

  • Random (R). A set of random images.

For example, we denote generations using our GRaWD loss with TT=10, and NN\uparrow group as RW-T10_NN\uparrow. Fig. 4 shows top liked/disliked paintings according to human evaluation.

Refer to caption
Figure 4: Most liked and disliked art generated using StyleGAN trained on GRaWD for the different groups.
Refer to caption
Figure 5: Percentage of each rating from human subject experiments on generated images. Compared to CAN, images from our loss are rated (5,4) by a significantly larger share of people, and are rated (1,2) by fewer people.

Human Evaluation.We performed our human subject MTurk experiments based on StyleGAN1(Karras, Laine, and Aila 2019b) & StyleGAN2 (Karras et al. 2020) architecture’s vanilla, CAN, and GRaWD variants. We divide the generations into four groups described above. We collect five responses for each art piece (400 images), totaling 2000 responses per model by 341 unique workers. We asked people to rate generations from 1 (extremely dislike) to 5 (extremely like), which was the first question (Q1). In Q2, we asked if a computer or an artist generates the images (Turing Test). More setup details are in the supplementary. We found that art from the trained StyleGAN1 and StyleGAN2 on our loss were more likeable and more people believed them to be real art, as shown in Table 5. For StyleGAN1, adding GRaWD loss resulted in 38% and 18% more people giving a full rating of 5 over vanilla StyleGAN1 and StyleGAN1 + CAN loss, respectively, see Fig. 5. For StyleGAN2, these improvements were 65% and 15%. Table 6 shows that images from the StyleGAN model on our loss have a better rank when combined with other sets of the table.

Refer to caption
Figure 6: Empirical approximation of Wundt Curve  (Packard 1975; Wundt 1874). The color of the data point represents a specific model and its label specifies the group named according to nomenclature. Art from the NN \uparrow group has lower likeability than the NN \downarrow group. Examples of a high and low likeability artwork and its novelty are shown.
Refer to caption
Figure 7: Distribution of emotional responses for StyleGAN generated art trained on GRAWD loss. An example image for fear, awe, and contentment is shown. The box beneath for each emotion shows the most frequent words used by evaluators to describe their feeling. These responses were collected from a survey on Mechanical Turk.

Wundt Curve Analysis. We approximate the Wundt curve (Packard 1975; Wundt 1874) by fitting a degree 3 polynomial on a scatter plot of normalized likeability vs. mean NN distance ( novelty measure). Generations are more likable if the deviation from existing art is moderate but not too much; see Fig. 6. Compared to CAN and GAN, our loss achieves on balance novel images that are more preferred.

Emotion Experiments. Evaluators selected the emotion they felt after looking at each art and justified their chosen emotion in text. We collected 5 responses each for a set of 600 generated artworks from 260 unique workers. Fig. 7 shows the distribution over the opted emotions, which are diverse but mostly positive. However, some generations construct negative emotions like fear. Fig. 7 also shows the most frequent words for each emotion after removing stop words. Notable positive words include “funny”, “beautiful”, “attractive”, and negative words include “dark”, “ghostly” which are associated with feelings like fear and disgust.

Conclusion

We propose Generative Random Walk Deviation (GRaWD) loss and showed that it improves generative models’ capability to better understand unseen classes on several zero-shot learning benchmarks and generate novel visual content trained on WikiArt dataset. We think the improvement is due to our learning mechanism’s global nature, which operates at the minibatch level producing generations that are message-passing to each other to facilitate better deviation of unseen classes/styles from seen ones.

\nobibliography

*

References

  • Akata et al. (2016) Akata, Z.; Perronnin, F.; Harchaoui, Z.; and Schmid, C. 2016. Label-embedding for image classification. PAMI, 38(7): 1425–1438.
  • Akata et al. (2015) Akata, Z.; Reed, S.; Walter, D.; Lee, H.; and Schiele, B. 2015. Evaluation of output embeddings for fine-grained image classification. In CVPR.
  • Ayyad et al. (2020) Ayyad, A.; Navab, N.; Elhoseiny, M.; and Albarqouni, S. 2020. Semi-Supervised Few-Shot Learning with Prototypical Random Walks.
  • Bell et al. (2016) Bell, S.; Lawrence Zitnick, C.; Bala, K.; and Girshick, R. 2016. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2874–2883.
  • Briot, Hadjeres, and Pachet (2017) Briot, J.-P.; Hadjeres, G.; and Pachet, F. 2017. Deep Learning Techniques for Music Generation-A Survey. arXiv:1709.01620.
  • Changpinyo et al. (2016) Changpinyo, S.; Chao, W.-L.; Gong, B.; and Sha, F. 2016. Synthesized classifiers for zero-shot learning. In CVPR, 5327–5336.
  • Chao et al. (2016) Chao, W.-L.; Changpinyo, S.; Gong, B.; and Sha, F. 2016. An Empirical Study and Analysis of Generalized Zero-Shot Learning for Object Recognition in the Wild. In ECCV, 52–68. Springer.
  • Chen et al. (2021) Chen, S.; Wang, W.; Xia, B.; Peng, Q.; You, X.; Zheng, F.; and Shao, L. 2021. FREE: Feature Refinement for Generalized Zero-Shot Learning. arXiv preprint arXiv:2107.13807.
  • Date, Ganesan, and Oates (2017) Date, P.; Ganesan, A.; and Oates, T. 2017. Fashioning with Networks: Neural Style Transfer to Design Clothes. In KDD ML4Fashion workshop.
  • DiPaola and Gabora (2009) DiPaola, S.; and Gabora, L. 2009. Incorporating characteristics of human creativity into an evolutionary art algorithm. Genetic Programming and Evolvable Machines, 10(2): 97–110.
  • Dumoulin et al. (2017) Dumoulin, V.; Shlens, J.; Kudlur, M.; Behboodi, A.; Lemic, F.; Wolisz, A.; Molinaro, M.; Hirche, C.; Hayashi, M.; Bagan, E.; et al. 2017. A learned representation for artistic style. ICLR.
  • Elgammal et al. (2017a) Elgammal, A.; Liu, B.; Elhoseiny, M.; and Mazzone, M. 2017a. CAN: Creative adversarial networks, generating” art” by learning about styles and deviating from style norms. arXiv preprint arXiv:1706.07068.
  • Elgammal et al. (2017b) Elgammal, A.; Liu, B.; Elhoseiny, M.; and Mazzone, M. 2017b. CAN: Creative adversarial networks, generating” art” by learning about styles and deviating from style norms. In International Conference on Computational Creativity.
  • Elhoseiny and Elfeki (2019) Elhoseiny, M.; and Elfeki, M. 2019. Creativity Inspired Zero-Shot Learning. In Proceedings of the IEEE International Conference on Computer Vision, 5784–5793.
  • Elhoseiny, Saleh, and Elgammal (2013) Elhoseiny, M.; Saleh, B.; and Elgammal, A. 2013. Write a classifier: Zero-shot learning using purely textual descriptions. In ICCV.
  • Elhoseiny et al. (2017) Elhoseiny, M.; Zhu, Y.; Zhang, H.; and Elgammal, A. 2017. Link the Head to the ”Beak”: Zero Shot Learning From Noisy Text Description at Part Precision. In CVPR.
  • Farhadi et al. (2009) Farhadi, A.; Endres, I.; Hoiem, D.; and Forsyth, D. 2009. Describing objects by their attributes. In CVPR 2009., 1778–1785. IEEE.
  • Felix et al. (2018) Felix, R.; Kumar, V. B.; Reid, I.; and Carneiro, G. 2018. Multi-modal cycle-consistent generalized zero-shot learning. In ECCV, 21–37.
  • Frome et al. (2013) Frome, A.; Corrado, G. S.; Shlens, J.; Bengio, S.; Dean, J.; Mikolov, T.; et al. 2013. Devise: A deep visual-semantic embedding model. In NIPS, 2121–2129.
  • Gatys, Ecker, and Bethge (2016) Gatys, L. A.; Ecker, A. S.; and Bethge, M. 2016. Image Style Transfer Using Convolutional Neural Networks. In CVPR.
  • Goodfellow et al. (2014) Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In NIPS, 2672–2680.
  • Gulrajani et al. (2017) Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; and Courville, A. 2017. Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028.
  • Guo et al. (2017a) Guo, Y.; Ding, G.; Han, J.; and Gao, Y. 2017a. Synthesizing Samples for Zero-shot Learning. In IJCAI.
  • Guo et al. (2017b) Guo, Y.; Ding, G.; Han, J.; and Gao, Y. 2017b. Zero-shot Learning with Transferred Samples. IEEE Transactions on Image Processing.
  • Ha and Eck (2018) Ha, D.; and Eck, D. 2018. A Neural Representation of Sketch Drawings. ICLR.
  • Haeusser, Mordvintsev, and Cremers (2017) Haeusser, P.; Mordvintsev, A.; and Cremers, D. 2017. Learning by Association — A Versatile Semi-Supervised Training Method for Neural Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 626–635.
  • Han et al. (2021) Han, Z.; Fu, Z.; Chen, S.; and Yang, J. 2021. Contrastive Embedding for Generalized Zero-Shot Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2371–2381.
  • Hertzmann (2018) Hertzmann, A. 2018. Can computers create art? In Arts, volume 7, 18. Multidisciplinary Digital Publishing Institute.
  • Isola et al. (2017) Isola, P.; Zhu, J.; Zhou, T.; and Efros, A. A. 2017. Image-to-Image Translation with Conditional Adversarial Networks. CVPR.
  • Johnson, Alahi, and Li (2016) Johnson, J.; Alahi, A.; and Li, F. 2016. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. ECCV.
  • Karras et al. (2018) Karras, T.; Aila, T.; Laine, S.; and Lehtinen, J. 2018. Progressive Growing of GANs for Improved Quality, Stability, and Variation. ICLR.
  • Karras, Laine, and Aila (2019a) Karras, T.; Laine, S.; and Aila, T. 2019a. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4401–4410.
  • Karras, Laine, and Aila (2019b) Karras, T.; Laine, S.; and Aila, T. 2019b. A Style-Based Generator Architecture for Generative Adversarial Networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  • Karras et al. (2020) Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; and Aila, T. 2020. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8110–8119.
  • Kingma and Welling (2013) Kingma, D. P.; and Welling, M. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
  • Kodirov, Xiang, and Gong (2017) Kodirov, E.; Xiang, T.; and Gong, S. 2017. Semantic autoencoder for zero-shot learning. arXiv preprint arXiv:1704.08345.
  • Kumar Verma et al. (2018) Kumar Verma, V.; Arora, G.; Mishra, A.; and Rai, P. 2018. Generalized zero-shot learning via synthesized examples. In CVPR.
  • Lampert, Nickisch, and Harmeling (2009a) Lampert, C. H.; Nickisch, H.; and Harmeling, S. 2009a. Learning to detect unseen object classes by between-class attribute transfer. In CVPR, 951–958. IEEE.
  • Lampert, Nickisch, and Harmeling (2009b) Lampert, C. H.; Nickisch, H.; and Harmeling, S. 2009b. Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, 951–958. IEEE.
  • Lampert, Nickisch, and Harmeling (2013) Lampert, C. H.; Nickisch, H.; and Harmeling, S. 2013. Attribute-based classification for zero-shot visual object categorization. IEEE transactions on pattern analysis and machine intelligence, 36(3): 453–465.
  • Li et al. (2019) Li, X.; Sun, Q.; Liu, Y.; Zhou, Q.; Zheng, S.; Chua, T.-S.; and Schiele, B. 2019. Learning to self-train for semi-supervised few-shot classification. In Advances in Neural Information Processing Systems, 10276–10286.
  • Liu et al. (2020) Liu, S.; Chen, J.; Pan, L.; Ngo, C.-W.; Chua, T.-S.; and Jiang, Y.-G. 2020. Hyperbolic visual embedding learning for zero-shot recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9273–9281.
  • Long et al. (2017) Long, Y.; Liu, L.; Shao, L.; Shen, F.; Ding, G.; and Han, J. 2017. From Zero-shot Learning to Conventional Supervised Classification: Unseen Visual Data Synthesis. In CVPR.
  • Machado and Cardoso (2000) Machado, P.; and Cardoso, A. 2000. NEvAr–the assessment of an evolutionary art tool. In Proc. of the AISB00 Symposium on Creative & Cultural Aspects and Applications of AI & Cognitive Science, volume 456.
  • Mirza and Osindero (2014) Mirza, M.; and Osindero, S. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784.
  • Mordvintsev, Olah, and Tyka (2015) Mordvintsev, A.; Olah, C.; and Tyka, M. 2015. Inceptionism: Going deeper into neural networks. Google Research Blog. Retrieved June.
  • Narayan et al. (2020) Narayan, S.; Gupta, A.; Khan, F. S.; Snoek, C. G.; and Shao, L. 2020. Latent embedding feedback and discriminative features for zero-shot classification. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16, 479–495. Springer.
  • Nobari, Rashad, and Ahmed (2021) Nobari, A. H.; Rashad, M. F.; and Ahmed, F. 2021. Creativegan: editing generative adversarial networks for creative design synthesis. arXiv preprint arXiv:2103.06242.
  • Odena, Olah, and Shlens (2017) Odena, A.; Olah, C.; and Shlens, J. 2017. Conditional image synthesis with auxiliary classifier gans. In ICML.
  • Packard (1975) Packard, S. 1975. Aesthetics and Psychobiology by DE Berlyne. Leonardo, 8(3): 258–259.
  • Patterson and Hays (2012) Patterson, G.; and Hays, J. 2012. Sun attribute database: Discovering, annotating, and recognizing scene attributes. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, 2751–2758. IEEE.
  • Qiao et al. (2016) Qiao, R.; Liu, L.; Shen, C.; and v. d. Hengel, A. 2016. Less is More: Zero-Shot Learning from Online Textual Documents with Noise Suppression. In CVPR.
  • Radford, Metz, and Chintala (2015) Radford, A.; Metz, L.; and Chintala, S. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.
  • Radford, Metz, and Chintala (2016) Radford, A.; Metz, L.; and Chintala, S. 2016. Unsupervised representation learning with deep convolutional generative adversarial networks. ICLR.
  • Reed et al. (2016) Reed, S. E.; Akata, Z.; Mohan, S.; Tenka, S.; Schiele, B.; and Lee, H. 2016. Learning What and Where to Draw. In NIPS.
  • Ren et al. (2018) Ren, M.; Ravi, S.; Triantafillou, E.; Snell, J.; Swersky, K.; Tenenbaum, J. B.; Larochelle, H.; and Zemel, R. S. 2018. Meta-Learning for Semi-Supervised Few-Shot Classification. In International Conference on Learning Representations.
  • Sbai et al. (2018) Sbai, O.; Elhoseiny, M.; Bordes, A.; LeCun, Y.; and Couprie, C. 2018. DeSIGN: Design Inspiration from Generative Networks. In ECCV workshop.
  • Schonfeld et al. (2019) Schonfeld, E.; Ebrahimi, S.; Sinha, S.; Darrell, T.; and Akata, Z. 2019. Generalized zero-and few-shot learning via aligned variational autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8247–8255.
  • Skorokhodov and Elhoseiny (2021) Skorokhodov, I.; and Elhoseiny, M. 2021. Class Normalization for (Continual)? Generalized Zero-Shot Learning. In International Conference on Learning Representations.
  • Tendulkar et al. (2019) Tendulkar, P.; Krishna, K.; Selvaraju, R. R.; and Parikh, D. 2019. Trick or TReAT: Thematic Reinforcement for Artistic Typography. In ICCC.
  • Van Horn et al. (2015) Van Horn, G.; Branson, S.; Farrell, R.; Haber, S.; Barry, J.; Ipeirotis, P.; Perona, P.; and Belongie, S. 2015. Building a Bird Recognition App and Large Scale Dataset With Citizen Scientists: The Fine Print in Fine-Grained Dataset Collection. In CVPR.
  • Vyas, Venkateswara, and Panchanathan (2020) Vyas, M.; Venkateswara, H.; and Panchanathan, S. 2020. Leveraging seen and unseen semantic relationships for generative zero-shot learning. In European Conference on Computer Vision, 70–86. Springer.
  • Wah et al. (2011) Wah, C.; Branson, S.; Welinder, P.; Perona, P.; and Belongie, S. 2011. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology.
  • WikiArt (2015) WikiArt, O. 2015. WikiArt Dataset. https://www.wikiart.org/. Accessed: 2020-05-30.
  • Wu et al. (2021) Wu, Q.; Zhu, B.; Yong, B.; Wei, Y.; Jiang, X.; Zhou, R.; and Zhou, Q. 2021. ClothGAN: generation of fashionable Dunhuang clothes using generative adversarial networks. Connection Science, 33(2): 341–358.
  • Wundt (1874) Wundt, W. M. 1874. Grundzüge der physiologischen Psychologie, volume 1. W. Engelman.
  • Xian et al. (2016) Xian, Y.; Akata, Z.; Sharma, G.; Nguyen, Q.; Hein, M.; and Schiele, B. 2016. Latent embeddings for zero-shot classification. In CVPR, 69–77.
  • Xian et al. (2018a) Xian, Y.; Lampert, C. H.; Schiele, B.; and Akata, Z. 2018a. Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. PAMI.
  • Xian et al. (2018b) Xian, Y.; Lorenz, T.; Schiele, B.; and Akata, Z. 2018b. Feature generating networks for zero-shot learning. In CVPR.
  • Zhang et al. (2017) Zhang, H.; Xu, T.; Li, H.; Zhang, S.; Wang, X.; Huang, X.; and Metaxas, D. 2017. StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks. In ICCV.
  • Zhang et al. (2019) Zhang, J.; Kalantidis, Y.; Rohrbach, M.; Paluri, M.; Elgammal, A.; and Elhoseiny, M. 2019. Large-scale visual relationship understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 9185–9194.
  • Zhang, Xiang, and Gong (2016) Zhang, L.; Xiang, T.; and Gong, S. 2016. Learning a Deep Embedding Model for Zero-Shot Learning. In CVPR.
  • Zhang et al. (2018) Zhang, R.; Che, T.; Ghahramani, Z.; Bengio, Y.; and Song, Y. 2018. MetaGAN: An Adversarial Approach to Few-Shot Learning. In Advances in Neural Information Processing Systems, 2371–2380.
  • Zhu et al. (2018) Zhu, Y.; Elhoseiny, M.; Liu, B.; Peng, X.; and Elgammal, A. 2018. A generative adversarial approach for zero-shot learning from noisy texts. In CVPR.