Imaginative Walks: Generative Random Walk Deviation Loss for Improved Unseen Learning Representation
Abstract
We propose a novel loss for generative models, dubbed as GRaWD (Generative Random Walk Deviation), to improve learning representations of unexplored visual spaces. Quality learning representation of unseen classes (or styles) is critical to facilitate novel image generation and better generative understanding of unseen visual classes, i.e., zero-shot learning (ZSL). By generating representations of unseen classes based on their semantic descriptions, e.g., attributes or text, generative ZSL attempts to differentiate unseen from seen categories. The proposed GRaWD loss is defined by constructing a dynamic graph that includes the seen class/style centers and generated samples in the current minibatch. Our loss initiates a random walk probability from each center through visual generations produced from hallucinated unseen classes. As a deviation signal, we encourage the random walk to eventually land after steps in a feature representation that is difficult to classify as any of the seen classes. We demonstrate that the proposed loss can improve unseen class representation quality inductively on text-based ZSL benchmarks on CUB and NABirds datasets and attribute-based ZSL benchmarks on AWA2, SUN, and aPY datasets. In addition, we investigate the ability of the proposed loss to generate meaningful novel visual art on the WikiArt dataset. The results of experiments and human evaluations demonstrate that the proposed GRaWD loss can improve StyleGAN1 and StyleGAN2 generation quality and create novel art that is significantly more preferable. Our code is made publicly available at https://github.com/Vision-CAIR/GRaWD.
Introduction

Generative models like GANs (Goodfellow et al. 2014) and VAEs (Kingma and Welling 2013) are excellent tools for generating realistic images due to their ability to represent high-dimensional probability distributions. However, they are not explicitly trained to go beyond distribution seen during training. In recent years, generative models have been adopted to go beyond training data distributions and improve unseen class recognition (also known as zero-shot learning) (Guo et al. 2017a; Long et al. 2017; Guo et al. 2017b; Kumar Verma et al. 2018; Zhu et al. 2018; Vyas, Venkateswara, and Panchanathan 2020). These methods train a conditional generative model (Mirza and Osindero 2014; Odena, Olah, and Shlens 2017), where is the semantic description of class (attributes or text descriptions) and represents within-class variation (e.g., ). After training, is used to generate imaginary data for unseen classes transforming ZSL into a traditional classification task trained on the generated data. Understanding unseen classes is mainly leveraged by the generative model’s improved ability to produce discriminative visual features/representations using from their corresponding unseen semantic descriptions (i.e., ).

To generate likable novel visual content, GANs’ training has been augmented with a loss that encourages careful deviation from existing classes (Elgammal et al. 2017a; Sbai et al. 2018; Hertzmann 2018). Such models were shown to have some capability to produce unseen aesthetic art (Elgammal et al. 2017a), fashion (Sbai et al. 2018; Wu et al. 2021), and design (Nobari, Rashad, and Ahmed 2021). In a generalized ZSL context, CIZSL (Elhoseiny and Elfeki 2019) showed an improved performance by modeling a similar deviation to encourage discrimination explicitly between seen and unseen classes. These losses improve unseen representation quality by encouraging the produced visual generations to be distinguishable from seen classes in ZSL (Elhoseiny and Elfeki 2019) and seen styles in art (Elgammal et al. 2017a) and fashion (Sbai et al. 2018) generation.
We propose Generative Random Walk Deviation Loss (GRaWD) as a parameter-free graph-based loss to improve learning representation of unseen classes; see Fig. 1. Our loss starts from each seen class (in green) and performs a random walk through the generated examples of hallucinated unseen classes (in orange) for steps. Then, we encourage the landing representation to be distant and distinguishable from the seen class centers. GRaWD loss is computed over a similarity graph involving seen class centers and generated examples in the current minibatch of hallucinated unseen classes. Thus, GRaWD takes a global view of the data manifold compared to existing deviation losses that are local/per example (e.g., Sbai et al. (2018); Elgammal et al. (2017a); Elhoseiny and Elfeki (2019)). In contrast to transductive methods (e.g., Vyas, Venkateswara, and Panchanathan (2020)), our loss is purely inductive; therefore, does not require real descriptions of unseen classes during training. Our work is connected to recent advances in semi-supervised learning (e.g., Zhang et al. (2018); Ayyad et al. (2020); Ren et al. (2018); Haeusser, Mordvintsev, and Cremers (2017); Li et al. (2019)) that leverage unlabeled data within the training classes. In these methods, unlabeled data are encouraged to be attracted to existing classes. Our goal is the opposite, deviating from seen classes. Also, our loss operates on generated data of hallucinated unseen classes instead of provided unlabeled data.
Contribution. We propose a generative random walk loss that leverages generated data by exploring the unseen embedding space discriminatively against the seen classes; see Fig. 1. Our loss is unsupervised on the generative space and can be applied to any GAN architecture (e.g., DCGAN (Radford, Metz, and Chintala 2016), StyleGAN (Karras, Laine, and Aila 2019a), and StyleGAN2 (Karras et al. 2020)). We show that our GRaWD loss helps understand unseen visual classes better, improving generalized zero-shot learning tasks on challenging benchmarks. We also show that compared to existing deviation losses, GRaWD improves the generative capability in unseen space of liked art; see Fig. 2.
Related Work

Generative Models with Deviation Losses. In the context of computational creativity, several approaches have been proposed to produce original items with aesthetic and meaningful characteristics (Machado and Cardoso 2000; Mordvintsev, Olah, and Tyka 2015; DiPaola and Gabora 2009; Tendulkar et al. 2019). Various early studies have made progress on writing pop songs (Briot, Hadjeres, and Pachet 2017), and transferring styles of great painters to other images (Gatys, Ecker, and Bethge 2016; Date, Ganesan, and Oates 2017; Dumoulin et al. 2017; Johnson, Alahi, and Li 2016; Isola et al. 2017) or doodling sketches (Ha and Eck 2018). The creative space of the style transfer images is limited by the content image and the transfer image, which could be an artistic image by Van Gogh. GANs (Goodfellow et al. 2014; Radford, Metz, and Chintala 2016; Ha and Eck 2018; Reed et al. 2016; Zhang et al. 2017; Karras et al. 2018; Karras, Laine, and Aila 2019a) have a capability to learn visual distributions and produce images from a latent vector. However, they are not trained explicitly to produce novel content beyond the training data. More recent work explored an early capability to produce novel art with CAN (Elgammal et al. 2017b), and fashion designs with a holistic CAN (an improved version of CAN) (Sbai et al. 2018), which are based on augmenting DCGAN (Radford, Metz, and Chintala 2016) with a loss encouraging deviation from existing styles. The difference between CAN and holistic-CAN is that the deviation signal is Binary Cross Entropy over individual styles for CAN (Elgammal et al. 2017b) and Multi-Class Cross Entropy (MCE) loss overall styles in Holistic-CAN (Sbai et al. 2018). Similar deviation losses were proposed in CIZSL (Elhoseiny and Elfeki 2019) for ZSL.
In contrast to these deviation losses, our loss is more global as it establishes dynamic messages between generations that are produced every mini-batch iteration and seen visual spaces. These generations should deviate from seen class spaces represented by class centers. In our experiments, we applied our loss on unseen class recognition and producing novel visual generations, showing superior performance compared to existing losses. We also note that random walks have been explored in the literature in the context of semi-supervised and few-shot learning for attracting unlabeled data points to its corresponding class (e.g. Ayyad et al. (2020); Haeusser, Mordvintsev, and Cremers (2017)). In contrast, we develop a random walk-based method to deviate from seen classes, which is an opposite objective, and operates on generated data rather than unlabeled data that are not available in purely inductive setups; see Fig. 1.
Zero-Shot Learning Methods. Classical ZSL methods directly predict attribute confidence from images to facilitate zero-shot recognition (e.g., seminal works by Lampert, Nickisch, and Harmeling (2009a, 2013) and Farhadi et al. (2009)). Current ZSL methods can be classified into two branches. One branch casts the task as a visual-semantic embedding problem (Frome et al. 2013; Skorokhodov and Elhoseiny 2021; Liu et al. 2020). Akata et al. (2015, 2016) proposed Attribute Label Embedding(ALE) to model visual-semantic embedding as a bilinear compatibility function between the image space and the attribute space. In (Zhang, Xiang, and Gong 2016), deep ZSL methods were presented to model the non-linear mapping between vision and class descriptions. In the context of ZSL from noisy textual descriptions, an early linear approach for Wikipedia-based ZSL was proposed in (Elhoseiny, Saleh, and Elgammal 2013). Orthogonal to these improvements, generative models like GANs (Goodfellow et al. 2014) and VAEs (Kingma and Welling 2013) have been adopted to formulate multi-modality in zero-shot recognition by synthesizing visual features of unseen classes given its semantic description, e.g., (Kumar Verma et al. 2018; Zhu et al. 2018; Schonfeld et al. 2019; Narayan et al. 2020; Han et al. 2021; Chen et al. 2021). Zhu et al. (2018) introduced a GAN model with a classification head with the standard real/fake head to improve text-based ZSL. Schonfeld et al. (2019) proposed a cross and distribution aligned VAE to better leverage the seen and unseen relationships. Han et al. (2021) utilized a generative network along with a multi-level supervised contrastive embedding strategy to learn images and semantic relationships. Our GRaWD loss helps improve the out-of-distribution performance of generative ZSL models.
Approach
We start this section by the formulation of our Generative Random Walk Deviation loss. We will show later in this section how it can be integrated with both generative ZSL models to improve unseen class recognition and with state-of-the-art deep-GAN models to encourage novel visual generations. We denote the generator as and its corresponding parameters as . As in (Xian et al. 2018b; Zhu et al. 2018; Elhoseiny and Elfeki 2019; Felix et al. 2018), the semantic representation can be concatenated with a random vector sampled from a Gaussian distribution to generate an image for visual art generation or visual features in the case of zero-shot learning. Hence, is the generated image / feature from the semantic description of class and the noise vector . We denote the discriminator as and its corresponding parameters as . The discriminator is trained with two objectives: (1) predict real for images from the training images and fake for generated ones. (2) identify the category of the input image. The discriminator then has two heads. The first head is for binary real/fake classification; classifier. The second head is a -way classifier over the seen classes. We denote the real/fake probability produced by for an input image as , and the classification score of a seen class given the image as .
Generative Random Walk Deviation Loss
We sample examples that we aim them to deviate from the seen classes/styles in the current minibatch with the generator . We denote the features of the these hallucinated generations as . These features are extracted by , a feature extraction function that we define as the activations from the last layer of the Discriminator followed by scaled L2 normalization . The scaled factor is mainly to amplify the norm of the vectors to avoid the vanishing gradient problem inspired from (Bell et al. 2016). We used guided by (Bell et al. 2016; Zhang et al. 2019). We denote the seen class centers that we aim to deviate from as , defined in the same feature space as , where represents center of seen class/style . The formulation of depends on the application (e.g., zero-shot learning or novel art generation), defined later in this section.
Let be the similarity matrix between each of the features of the generations ( ) and the cluster centers ( ). Similarly, let compute the similarity matrix between the generated points. In particular, we use the negative Euclidean distances between the embeddings as a similarity measure as follows:
(1) |
where and are and features in the set ; see Fig. 3. To avoid self-cycle, The diagonal entries are set to a small number . Hence, we defined three transition probability matrices:
(2) |
where is the softmax operator is applied over each row of the input matrix, and are the transition probability matrices from each seen class over the generated points and vice-versa respectively. is the transition probability matrix from each generated point over other generated points. We hence define our generative random walker probability matrix as:
(3) |
where denotes the probability of ending a random walk of a length at a seen class given that we have started at seen class ; denotes the number of steps taken between the generated points, before stepping back to land on a seen class/style.
Loss. Our random walk loss aims at boosting the deviation of unseen visual spaces from seen classes. Hence, we define our loss by encouraging each row in to be hard to classify to seen classes as follows
(4) | ||||
where first term minimizes cross entropy loss between every row in and uniform distribution over seen classes , where is a hyperparameter and is exponential decay set to in our experiments. In the second term, we maximizes the probability of all the generations to be equality visited by the random walk. Note that, if we replaced by an identity matrix to encourage landing to the starting seen class, the loss becomes an attraction signal similar to (Haeusser, Mordvintsev, and Cremers 2017), which defines its conceptual difference to GRaWD. We call this version GRaWT, T for aTraction. The second term is called ‘visit loss’ was proposed in (Haeusser, Mordvintsev, and Cremers 2017) to encourage random walker to visit a large set of unlabeled points. We compute the overall probability that each generated point would be visited by any of the seen class , where represents the row of the matrix; see Fig. 3. The visit loss is then defined as the cross-entropy between and the uniform distribution . Hence, visit loss encourages to visit as many examples as possible from and hence improves learning representation.
GRaWD Integration with Generative ZSL
Let’s denote the set of seen and unseen class labels as and , where . We denote the semantic representations of unseen classes and seen classes as and respectively, where is the semantic space and is the semantic description function that extract features from text article or attribute description of class . Let’s denote the seen data as , where denotes the visual features of the image, is the corresponding seen category label. For unseen classes, we are given only their semantic representations, one per class, . We define as the number of unseen classes. In Generalized ZSL (GZSL), we aim to predict the label at test time given that may belong to seen or unseen classes. We represent seen classes as , where represents the center of class that we define as
(5) |
where is the attribute or text description of seen class . are sampled by where , is a semantic description of a hallucinated unseen class. We explicitly explore the unseen/creative space of the generator with a hallucinated semantic representation , where is a probability distribution over unseen classes, aimed to be likely hard negatives to seen classes. We sample following the strategy proposed in (Elhoseiny and Elfeki 2019) due to its simplicity and effectiveness. It picks two seen semantic descriptions at random . We then sample , where is uniformly sampled between and . Note that of values near or are discarded to avoid sampling semantic descriptions that are very close to seen classes. We then integrate with the Generator loss as follows.
(6) | ||||
Here, the first term is the proposed GRaWD loss. The second and the third terms trick the generator into classifying the visual generations from both the seen semantic descriptions and unseen semantic descriptions , as real. The fourth term encourages the generator to discriminatively generate visual features conditioned on a given seen class description. We then define the Discriminator loss as
(7) | ||||
Here, image and corresponding class one-hot label are sampled from the data distribution . and are features of a semantic description and the corresponding one-hot label sampled from seen classes . The first three terms approximate Wasserstein distance of the distribution of real features and fake features, and fourth term is the gradient penalty to enforce the Lipschitz constraint: , where is the linear interpolation of the real feature and the fake feature ; see (Gulrajani et al. 2017). The last two terms are the classification losses of the real and generated data to their corresponding classes.
GRaWD Integration with StyleGANs for Novel Art Generation
We integrated our loss with DCGAN (Radford, Metz, and Chintala 2015), StyleGAN (Karras, Laine, and Aila 2019a) and StyleGAN2 (Karras et al. 2020) by simply adding in Eq. 4 to the generator loss. We assume to have seen art styles that we aim to deviate from. Here, we define by sampling a small episodic memory of size for every class and computing from discriminator features. We randomly sample examples per class once and compute its mean representation at each iteration. We provide more training details in supplementary.
Setting | CUB-Easy | CUB-Hard | |||
---|---|---|---|---|---|
Top-1 Acc (%) | SU-AUC (%) | Top1-Acc (%) | SU-AUC (%) | ||
Deviation losses | + GRaWT (T=0) | 44.0 | 39.5 | 13.7 | 11.8 |
+ GRaWT (T=3) | 43.4 | 38.8 | 13.2 | 11.4 | |
on GAZSL (Zhu et al. 2018) | + Classify as class | 43.2 | 38.3 | 11.31 | 9.5 |
+ CIZSL(Elhoseiny and Elfeki 2019) | 44.6 | 39.2 | 14.4 | 11.9 | |
Walk length | + GRaWD (T=1) | 45.41 | 39.62 | 13.79 | 12.58 |
+ GRaWD (T=3) | 45.11 | 39.25 | 14.21 | 13.22 | |
on GAZSL (Zhu et al. 2018) | + GRaWD (T=5) | 45.40 | 40.51 | 14.00 | 13.07 |
+ GRaWD (T=10) | 45.43 | 40.68 | 15.51 | 13.70 |
Metric | Top-1 Accuracy (%) | Seen-Unseen AUC (%) | |||||||
---|---|---|---|---|---|---|---|---|---|
Dataset | CUB | NAB | CUB | NAB | |||||
Split-Mode | Easy | Hard | Easy | Hard | Easy | Hard | Easy | Hard | |
ZSLNS (Qiao et al. 2016) | 29.1 | 7.3 | 24.5 | 6.8 | 14.7 | 4.4 | 9.3 | 2.3 | |
SynCfast (Changpinyo et al. 2016) | 28.0 | 8.6 | 18.4 | 3.8 | 13.1 | 4.0 | 2.7 | 3.5 | |
ZSLPP (Elhoseiny et al. 2017) | 37.2 | 9.7 | 30.3 | 8.1 | 30.4 | 6.1 | 12.6 | 3.5 | |
FeatGen (Xian et al. 2018b) | 43.9 | 9.8 | 36.2 | 8.7 | 34.1 | 7.4 | 21.3 | 5.6 | |
LsrGAN (tr) (Vyas et al. 2020) | 45.2 | 14.2 | 36.4 | 9.0 | 39.5 | 12.1 | 23.2 | 6.4 | |
+GRaWD | 45.6+0.4 | 15.1+0.9 | 37.8+1.4 | 9.7+0.7 | 39.9+0.4 | 13.3+1.2 | 24.5+1.3 | 6.7+0.3 | |
GAZSL (Zhu et al. 2018) | 43.7 | 10.3 | 35.6 | 8.6 | 35.4 | 8.7 | 20.4 | 5.8 | |
+CIZSL (Elhoseiny and Elfeki 2019) | 44.6 | 14.4 | 36.6 | 9.3 | 39.2 | 11.9 | 24.5 | 6.4 | |
+ GRaWD | 45.4+1.7 | 15.5 +5.2 | 38.4 +2.8 | 10.1 +1.5 | 40.7 +5.3 | 13.7+5.0 | 25.8+5.4 | 7.4 +1.6 |
AwA2 | aPY | SUN | |||||||
---|---|---|---|---|---|---|---|---|---|
H | S | U | H | S | U | H | S | U | |
GRaWT (T=0) | 32.3 | 80.5 | 20.2 | 23.0 | 78.9 | 13.4 | 26.0 | 31.6 | 22.2 |
GRaWT (T=3) | 31.6 | 80.7 | 19.7 | 22.4 | 75.8 | 13.1 | 25.8 | 31.1 | 22.1 |
GRaWD | 39.0 | 88.3 | 25.0 | 27.2 | 83.2 | 16.3 | 27.9 | 37.3 | 22.3 |
Top-1 Accuracy(%) | Seen-Unseen H | ||||||
AwA2 | aPY | SUN | AwA2 | aPY | SUN | ||
SJE (Akata et al. 2015) | 61.9 | 35.2 | 53.7 | 14.4 | 6.9 | 19.8 | |
LATEM (Xian et al. 2016) | 55.8 | 35.2 | 55.3 | 20.0 | 0.2 | 19.5 | |
ALE (Akata et al. 2016) | 62.5 | 39.7 | 58.1 | 23.9 | 8.7 | 26.3 | |
SYNC (Changpinyo et al. 2016) | 46.6 | 23.9 | 56.3 | 18.0 | 13.3 | 13.4 | |
SAE (Kodirov, Xiang, and Gong 2017) | 54.1 | 8.3 | 40.3 | 2.2 | 0.9 | 11.8 | |
DEM (Zhang, Xiang, and Gong 2016) | 67.1 | 35.0 | 61.9 | 25.1 | 19.4 | 25.6 | |
FeatGen (Xian et al. 2018b) | 54.3 | 42.6 | 60.8 | 17.6 | 21.4 | 24.9 | |
cycle-(U)WGAN (Felix et al. 2018) | 56.2 | 44.6 | 60.3 | 19.2 | 23.6 | 24.4 | |
LsrGAN (tr) (Vyas et al. 2020) | 62.5 | 44.8 | |||||
+ GRaWD | 63.7+3.6 | 35.5+0.9 | 64.2+1.7 | 49.2+0.5 | 32.7+1.2 | 46.1+1.3 | |
GAZSL (Zhu et al. 2018) | 58.9 | 41.1 | 61.3 | 15.4 | 24.0 | 26.7 | |
+ CIZSL (Elhoseiny and Elfeki 2019) | 67.8 | 42.1 | 63.7 | 24.6 | 25.7 | 27.8 | |
+ GRaWD | 68.4+9.5 | 43.3+2.2 | 62.1+0.8 | 39.0+23.6 | 27.2+3.2 | 27.9+1.2 |
Experiments
Purely Inductive Generative ZSL Experiments
Purely Inductive Evaluation in ZSL: Our focus in this paper is to learn a good representation of unseen visual spaces without accessing any unseen class information during training. However, most recent papers jointly train an extra classifier (e.g., an MLP) with their proposed generative model (Narayan et al. 2020; Han et al. 2021). More concretely, this classifier is trained with generated from unseen semantic description . While in GZSL, training seen images along with is introduced as input of the extra classifier. We refer to methods that assume access to unseen class descriptions during training as semantic transductive ZSL (even if not using unlabeled images of unseen classes). Accessing unseen information before evaluation is not in line with our focus on learning generative unseen learning representation. This is also less realistic if we aim at purely inductive zero-shot learning. Following purely inductive ZSL settings (e.g., (Zhu et al. 2018; Elhoseiny and Elfeki 2019)), we use NN-classification on the generated features for evaluation, which avoids accessing any unseen semantic information before testing.
We performed experiments on existing ZSL benchmarks with text descriptions and attributes as semantic class descriptions. Note that the text description setting is more challenging because it is at the class-level and is extracted from Wikipedia, which is noisier. We found that random walk steps easy to tune using the validation set.
Text-Based ZSL. We performed our text-based ZSL experiments on Caltech UCSD Birds-2011 (CUB) (Wah et al. 2011) containing 200 classes with 11, 788 images and North America Birds (NAB) (Van Horn et al. 2015) which has 1011 classes with 48, 562 images. We use two metrics widely used in evaluating ZSL recognition performance: standard zero-shot recognition with the Top-1 unseen class accuracy and Seen-Unseen Generalized Zero-shot performance with Area under Seen-Unseen curve (Chao et al. 2016). We follow (Chao et al. 2016; Zhu et al. 2018; Elhoseiny and Elfeki 2019) in using the Area Under SUC to evaluate the generalization capability of class-level text zero-shot recognition on four splits (CUB Easy, CUB Hard, NAB Easy, and NAB Hard). The hard splits are constructed such that unseen bird classes from super-categories do not overlap with seen classes. Our proposed loss function improves over older methods on all datasets on both Easy and SCE(hard) splits, as shown in Table 4. We show improvements in the range of 0.8-1.8% Top-1 accuracy. We also show improvements in AUC, ranging from 1-1.8%. From Table 4, we show that GAZSL (Zhu et al. 2018)+GRaWD has an average relative Seen-Uneen AUC improvement over GAZSL (Zhu et al. 2018)+CIZSL (Elhoseiny and Elfeki 2019) and GAZSL (Zhu et al. 2018) only of 9.29% and 30.89%. We achieved SOTA results for text datasets. In Table 4, we performed an ablation study where we show that longer random walks performed better hence giving higher accuracies and AUC scores for both easy and hard split for CUB Dataset. With longer walks, the model could have a more holistic view of the generated visual representation in a way that enables better deviation of unseen classes from unseen classes. Therefore, we used T=10 for our experiments.
Attribute-Based ZSL. We performed experiments on the widely used GBU (Xian et al. 2018a) setup, where we use class attributes as semantic descriptors. We performed these experiments on the AwA2 (Lampert, Nickisch, and Harmeling 2009b), aPY (Farhadi et al. 2009), and SUN (Patterson and Hays 2012) datasets. In Table 4, we see that GRaWD outperforms all of the existing methods on seen-unseen harmonic mean for AwA2, aPY, and SUN datasets. In the case of the AwA2 dataset, it outperformed the compared method by a significant margin, i.e., 15.1%. It is also competent with existing methods in Top-1 accuracy while improving on AwA2 4.8%. From Table 4, GAZSL (Zhu et al. 2018)+GRaWD has an average relative improvement over GAZSL (Zhu et al. 2018)+CIZSL (Elhoseiny and Elfeki 2019) and GAZSL (Zhu et al. 2018) of 24.92% and 61.35% in harmonic mean.
Table 4 and 4 show that deviation signal in GRaWD is critical to achieve better performance since the calculated metrics are much better for GRaWD compared to GRaWT for both text-based and attribute based ZSL. The performance can severely degrade without the deviation signal. Tab. 4 (bottom section) shows that longer walk lengths benefits the training as model is encouraged to globally explore larger segments of unseen representations’ manifold.
GRaWD Loss for Transductive ZSL. We also apply our GRaWD loss to transductive ZSL setting and choose choose LsrGAN (Vyas, Venkateswara, and Panchanathan 2020) as the baseline model. Our loss can also improve LsrGAN on both text-based datasets and attribute-based datasets on most metrics ranging from 0.3%-3.6%. Despite that our loss does not use unseen class descriptors, it can still improve on average on LsrGAN (transductive) by 1.96% on attribute datasets and 2.91% on text-based datasets. However, in line with our expectations, the former improvement in the purely inductive setting is more significant.
Novel Art Generation Experiments.
We performed our Art experiments on the WikiArt dataset, containing 81k images of 27 different styles (WikiArt 2015).
Likeability Mean Turing Test Loss Architecture Q1-mean(std) NN NN Entropy Random Q2(% Artist) CAN (Elgammal et al. 2017b) DCGAN 3.20(1.50) - - - - 53 GAN (Vanilla) StyleGAN 3.12(0.58) 3.07 3.36 3.00 3.06 55.33 CAN StyleGAN 3.20(0.62) 3.01 3.61 3.05 3.11 56.55 RW-T3 (Ours) StyleGAN 3.29(0.59) 3.05 3.58 3.13 3.38 54.08 RW-T10 (Ours) StyleGAN 3.29(0.63) 3.15 3.67 3.15 3.17 58.63 GAN (Vanilla) StyleGAN2 3.02(1.15) 2.89 3.30 2.79 3.09 54.01 CAN StyleGAN2 3.23(1.16) 3.27 3.34 3.11 3.21 57.9 RW-T3 (Ours) StyleGAN2 3.40(1.1) 3.30 3.61 3.33 3.35 64.0
Normalized Mean Ranks CAN/RW-T10 CAN/RW-T3 CAN/RW-T10/RW-T3 StyleGAN1 0.53/0.47 0.53/0.47 0.52/0.48/0.50 CAN/RW-T3 GAN/RW-T3 CAN/GAN/RW-T3 StyleGAN2 0.54/0.46 0.59/0.41 0.49/0.59/0.42
Baselines. We performed comparisons with two baselines, i.e., (1) the vanilla GAN for the chosen architecture, and (2) adding Holistic-CAN loss (Sbai et al. 2018) (an improved version of CAN (Elgammal et al. 2017b)). For simplicity, we refer the Holistic-CAN as CAN.
Nomenclature. Here, the models are referred as RW-T(value), where RW means GRaWD loss, and T is the number of steps. We name our models according to this convention throughout. We perform human subject experiments to evaluate generated art. We set value of the loss coefficient as 10. We divide the generations from these models into four groups, each containing 100 images; see examples in Fig. 4.
-
•
NN. Images with high nearest neighbor (NN) distance from the training dataset.
-
•
NN. Images with low NN distance from the training dataset.
-
•
Entropy . Images with high entropy of the probabilities from a style classifier.
-
•
Random (R). A set of random images.
For example, we denote generations using our GRaWD loss with =10, and NN group as RW-T10_NN. Fig. 4 shows top liked/disliked paintings according to human evaluation.


Human Evaluation.We performed our human subject MTurk experiments based on StyleGAN1(Karras, Laine, and Aila 2019b) & StyleGAN2 (Karras et al. 2020) architecture’s vanilla, CAN, and GRaWD variants. We divide the generations into four groups described above. We collect five responses for each art piece (400 images), totaling 2000 responses per model by 341 unique workers. We asked people to rate generations from 1 (extremely dislike) to 5 (extremely like), which was the first question (Q1). In Q2, we asked if a computer or an artist generates the images (Turing Test). More setup details are in the supplementary. We found that art from the trained StyleGAN1 and StyleGAN2 on our loss were more likeable and more people believed them to be real art, as shown in Table 5. For StyleGAN1, adding GRaWD loss resulted in 38% and 18% more people giving a full rating of 5 over vanilla StyleGAN1 and StyleGAN1 + CAN loss, respectively, see Fig. 5. For StyleGAN2, these improvements were 65% and 15%. Table 6 shows that images from the StyleGAN model on our loss have a better rank when combined with other sets of the table.


Wundt Curve Analysis. We approximate the Wundt curve (Packard 1975; Wundt 1874) by fitting a degree 3 polynomial on a scatter plot of normalized likeability vs. mean NN distance ( novelty measure). Generations are more likable if the deviation from existing art is moderate but not too much; see Fig. 6. Compared to CAN and GAN, our loss achieves on balance novel images that are more preferred.
Emotion Experiments. Evaluators selected the emotion they felt after looking at each art and justified their chosen emotion in text. We collected 5 responses each for a set of 600 generated artworks from 260 unique workers. Fig. 7 shows the distribution over the opted emotions, which are diverse but mostly positive. However, some generations construct negative emotions like fear. Fig. 7 also shows the most frequent words for each emotion after removing stop words. Notable positive words include “funny”, “beautiful”, “attractive”, and negative words include “dark”, “ghostly” which are associated with feelings like fear and disgust.
Conclusion
We propose Generative Random Walk Deviation (GRaWD) loss and showed that it improves generative models’ capability to better understand unseen classes on several zero-shot learning benchmarks and generate novel visual content trained on WikiArt dataset. We think the improvement is due to our learning mechanism’s global nature, which operates at the minibatch level producing generations that are message-passing to each other to facilitate better deviation of unseen classes/styles from seen ones.
*
References
- Akata et al. (2016) Akata, Z.; Perronnin, F.; Harchaoui, Z.; and Schmid, C. 2016. Label-embedding for image classification. PAMI, 38(7): 1425–1438.
- Akata et al. (2015) Akata, Z.; Reed, S.; Walter, D.; Lee, H.; and Schiele, B. 2015. Evaluation of output embeddings for fine-grained image classification. In CVPR.
- Ayyad et al. (2020) Ayyad, A.; Navab, N.; Elhoseiny, M.; and Albarqouni, S. 2020. Semi-Supervised Few-Shot Learning with Prototypical Random Walks.
- Bell et al. (2016) Bell, S.; Lawrence Zitnick, C.; Bala, K.; and Girshick, R. 2016. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2874–2883.
- Briot, Hadjeres, and Pachet (2017) Briot, J.-P.; Hadjeres, G.; and Pachet, F. 2017. Deep Learning Techniques for Music Generation-A Survey. arXiv:1709.01620.
- Changpinyo et al. (2016) Changpinyo, S.; Chao, W.-L.; Gong, B.; and Sha, F. 2016. Synthesized classifiers for zero-shot learning. In CVPR, 5327–5336.
- Chao et al. (2016) Chao, W.-L.; Changpinyo, S.; Gong, B.; and Sha, F. 2016. An Empirical Study and Analysis of Generalized Zero-Shot Learning for Object Recognition in the Wild. In ECCV, 52–68. Springer.
- Chen et al. (2021) Chen, S.; Wang, W.; Xia, B.; Peng, Q.; You, X.; Zheng, F.; and Shao, L. 2021. FREE: Feature Refinement for Generalized Zero-Shot Learning. arXiv preprint arXiv:2107.13807.
- Date, Ganesan, and Oates (2017) Date, P.; Ganesan, A.; and Oates, T. 2017. Fashioning with Networks: Neural Style Transfer to Design Clothes. In KDD ML4Fashion workshop.
- DiPaola and Gabora (2009) DiPaola, S.; and Gabora, L. 2009. Incorporating characteristics of human creativity into an evolutionary art algorithm. Genetic Programming and Evolvable Machines, 10(2): 97–110.
- Dumoulin et al. (2017) Dumoulin, V.; Shlens, J.; Kudlur, M.; Behboodi, A.; Lemic, F.; Wolisz, A.; Molinaro, M.; Hirche, C.; Hayashi, M.; Bagan, E.; et al. 2017. A learned representation for artistic style. ICLR.
- Elgammal et al. (2017a) Elgammal, A.; Liu, B.; Elhoseiny, M.; and Mazzone, M. 2017a. CAN: Creative adversarial networks, generating” art” by learning about styles and deviating from style norms. arXiv preprint arXiv:1706.07068.
- Elgammal et al. (2017b) Elgammal, A.; Liu, B.; Elhoseiny, M.; and Mazzone, M. 2017b. CAN: Creative adversarial networks, generating” art” by learning about styles and deviating from style norms. In International Conference on Computational Creativity.
- Elhoseiny and Elfeki (2019) Elhoseiny, M.; and Elfeki, M. 2019. Creativity Inspired Zero-Shot Learning. In Proceedings of the IEEE International Conference on Computer Vision, 5784–5793.
- Elhoseiny, Saleh, and Elgammal (2013) Elhoseiny, M.; Saleh, B.; and Elgammal, A. 2013. Write a classifier: Zero-shot learning using purely textual descriptions. In ICCV.
- Elhoseiny et al. (2017) Elhoseiny, M.; Zhu, Y.; Zhang, H.; and Elgammal, A. 2017. Link the Head to the ”Beak”: Zero Shot Learning From Noisy Text Description at Part Precision. In CVPR.
- Farhadi et al. (2009) Farhadi, A.; Endres, I.; Hoiem, D.; and Forsyth, D. 2009. Describing objects by their attributes. In CVPR 2009., 1778–1785. IEEE.
- Felix et al. (2018) Felix, R.; Kumar, V. B.; Reid, I.; and Carneiro, G. 2018. Multi-modal cycle-consistent generalized zero-shot learning. In ECCV, 21–37.
- Frome et al. (2013) Frome, A.; Corrado, G. S.; Shlens, J.; Bengio, S.; Dean, J.; Mikolov, T.; et al. 2013. Devise: A deep visual-semantic embedding model. In NIPS, 2121–2129.
- Gatys, Ecker, and Bethge (2016) Gatys, L. A.; Ecker, A. S.; and Bethge, M. 2016. Image Style Transfer Using Convolutional Neural Networks. In CVPR.
- Goodfellow et al. (2014) Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In NIPS, 2672–2680.
- Gulrajani et al. (2017) Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; and Courville, A. 2017. Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028.
- Guo et al. (2017a) Guo, Y.; Ding, G.; Han, J.; and Gao, Y. 2017a. Synthesizing Samples for Zero-shot Learning. In IJCAI.
- Guo et al. (2017b) Guo, Y.; Ding, G.; Han, J.; and Gao, Y. 2017b. Zero-shot Learning with Transferred Samples. IEEE Transactions on Image Processing.
- Ha and Eck (2018) Ha, D.; and Eck, D. 2018. A Neural Representation of Sketch Drawings. ICLR.
- Haeusser, Mordvintsev, and Cremers (2017) Haeusser, P.; Mordvintsev, A.; and Cremers, D. 2017. Learning by Association — A Versatile Semi-Supervised Training Method for Neural Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 626–635.
- Han et al. (2021) Han, Z.; Fu, Z.; Chen, S.; and Yang, J. 2021. Contrastive Embedding for Generalized Zero-Shot Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2371–2381.
- Hertzmann (2018) Hertzmann, A. 2018. Can computers create art? In Arts, volume 7, 18. Multidisciplinary Digital Publishing Institute.
- Isola et al. (2017) Isola, P.; Zhu, J.; Zhou, T.; and Efros, A. A. 2017. Image-to-Image Translation with Conditional Adversarial Networks. CVPR.
- Johnson, Alahi, and Li (2016) Johnson, J.; Alahi, A.; and Li, F. 2016. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. ECCV.
- Karras et al. (2018) Karras, T.; Aila, T.; Laine, S.; and Lehtinen, J. 2018. Progressive Growing of GANs for Improved Quality, Stability, and Variation. ICLR.
- Karras, Laine, and Aila (2019a) Karras, T.; Laine, S.; and Aila, T. 2019a. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4401–4410.
- Karras, Laine, and Aila (2019b) Karras, T.; Laine, S.; and Aila, T. 2019b. A Style-Based Generator Architecture for Generative Adversarial Networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
- Karras et al. (2020) Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; and Aila, T. 2020. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8110–8119.
- Kingma and Welling (2013) Kingma, D. P.; and Welling, M. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
- Kodirov, Xiang, and Gong (2017) Kodirov, E.; Xiang, T.; and Gong, S. 2017. Semantic autoencoder for zero-shot learning. arXiv preprint arXiv:1704.08345.
- Kumar Verma et al. (2018) Kumar Verma, V.; Arora, G.; Mishra, A.; and Rai, P. 2018. Generalized zero-shot learning via synthesized examples. In CVPR.
- Lampert, Nickisch, and Harmeling (2009a) Lampert, C. H.; Nickisch, H.; and Harmeling, S. 2009a. Learning to detect unseen object classes by between-class attribute transfer. In CVPR, 951–958. IEEE.
- Lampert, Nickisch, and Harmeling (2009b) Lampert, C. H.; Nickisch, H.; and Harmeling, S. 2009b. Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, 951–958. IEEE.
- Lampert, Nickisch, and Harmeling (2013) Lampert, C. H.; Nickisch, H.; and Harmeling, S. 2013. Attribute-based classification for zero-shot visual object categorization. IEEE transactions on pattern analysis and machine intelligence, 36(3): 453–465.
- Li et al. (2019) Li, X.; Sun, Q.; Liu, Y.; Zhou, Q.; Zheng, S.; Chua, T.-S.; and Schiele, B. 2019. Learning to self-train for semi-supervised few-shot classification. In Advances in Neural Information Processing Systems, 10276–10286.
- Liu et al. (2020) Liu, S.; Chen, J.; Pan, L.; Ngo, C.-W.; Chua, T.-S.; and Jiang, Y.-G. 2020. Hyperbolic visual embedding learning for zero-shot recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9273–9281.
- Long et al. (2017) Long, Y.; Liu, L.; Shao, L.; Shen, F.; Ding, G.; and Han, J. 2017. From Zero-shot Learning to Conventional Supervised Classification: Unseen Visual Data Synthesis. In CVPR.
- Machado and Cardoso (2000) Machado, P.; and Cardoso, A. 2000. NEvAr–the assessment of an evolutionary art tool. In Proc. of the AISB00 Symposium on Creative & Cultural Aspects and Applications of AI & Cognitive Science, volume 456.
- Mirza and Osindero (2014) Mirza, M.; and Osindero, S. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784.
- Mordvintsev, Olah, and Tyka (2015) Mordvintsev, A.; Olah, C.; and Tyka, M. 2015. Inceptionism: Going deeper into neural networks. Google Research Blog. Retrieved June.
- Narayan et al. (2020) Narayan, S.; Gupta, A.; Khan, F. S.; Snoek, C. G.; and Shao, L. 2020. Latent embedding feedback and discriminative features for zero-shot classification. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16, 479–495. Springer.
- Nobari, Rashad, and Ahmed (2021) Nobari, A. H.; Rashad, M. F.; and Ahmed, F. 2021. Creativegan: editing generative adversarial networks for creative design synthesis. arXiv preprint arXiv:2103.06242.
- Odena, Olah, and Shlens (2017) Odena, A.; Olah, C.; and Shlens, J. 2017. Conditional image synthesis with auxiliary classifier gans. In ICML.
- Packard (1975) Packard, S. 1975. Aesthetics and Psychobiology by DE Berlyne. Leonardo, 8(3): 258–259.
- Patterson and Hays (2012) Patterson, G.; and Hays, J. 2012. Sun attribute database: Discovering, annotating, and recognizing scene attributes. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, 2751–2758. IEEE.
- Qiao et al. (2016) Qiao, R.; Liu, L.; Shen, C.; and v. d. Hengel, A. 2016. Less is More: Zero-Shot Learning from Online Textual Documents with Noise Suppression. In CVPR.
- Radford, Metz, and Chintala (2015) Radford, A.; Metz, L.; and Chintala, S. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.
- Radford, Metz, and Chintala (2016) Radford, A.; Metz, L.; and Chintala, S. 2016. Unsupervised representation learning with deep convolutional generative adversarial networks. ICLR.
- Reed et al. (2016) Reed, S. E.; Akata, Z.; Mohan, S.; Tenka, S.; Schiele, B.; and Lee, H. 2016. Learning What and Where to Draw. In NIPS.
- Ren et al. (2018) Ren, M.; Ravi, S.; Triantafillou, E.; Snell, J.; Swersky, K.; Tenenbaum, J. B.; Larochelle, H.; and Zemel, R. S. 2018. Meta-Learning for Semi-Supervised Few-Shot Classification. In International Conference on Learning Representations.
- Sbai et al. (2018) Sbai, O.; Elhoseiny, M.; Bordes, A.; LeCun, Y.; and Couprie, C. 2018. DeSIGN: Design Inspiration from Generative Networks. In ECCV workshop.
- Schonfeld et al. (2019) Schonfeld, E.; Ebrahimi, S.; Sinha, S.; Darrell, T.; and Akata, Z. 2019. Generalized zero-and few-shot learning via aligned variational autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8247–8255.
- Skorokhodov and Elhoseiny (2021) Skorokhodov, I.; and Elhoseiny, M. 2021. Class Normalization for (Continual)? Generalized Zero-Shot Learning. In International Conference on Learning Representations.
- Tendulkar et al. (2019) Tendulkar, P.; Krishna, K.; Selvaraju, R. R.; and Parikh, D. 2019. Trick or TReAT: Thematic Reinforcement for Artistic Typography. In ICCC.
- Van Horn et al. (2015) Van Horn, G.; Branson, S.; Farrell, R.; Haber, S.; Barry, J.; Ipeirotis, P.; Perona, P.; and Belongie, S. 2015. Building a Bird Recognition App and Large Scale Dataset With Citizen Scientists: The Fine Print in Fine-Grained Dataset Collection. In CVPR.
- Vyas, Venkateswara, and Panchanathan (2020) Vyas, M.; Venkateswara, H.; and Panchanathan, S. 2020. Leveraging seen and unseen semantic relationships for generative zero-shot learning. In European Conference on Computer Vision, 70–86. Springer.
- Wah et al. (2011) Wah, C.; Branson, S.; Welinder, P.; Perona, P.; and Belongie, S. 2011. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology.
- WikiArt (2015) WikiArt, O. 2015. WikiArt Dataset. https://www.wikiart.org/. Accessed: 2020-05-30.
- Wu et al. (2021) Wu, Q.; Zhu, B.; Yong, B.; Wei, Y.; Jiang, X.; Zhou, R.; and Zhou, Q. 2021. ClothGAN: generation of fashionable Dunhuang clothes using generative adversarial networks. Connection Science, 33(2): 341–358.
- Wundt (1874) Wundt, W. M. 1874. Grundzüge der physiologischen Psychologie, volume 1. W. Engelman.
- Xian et al. (2016) Xian, Y.; Akata, Z.; Sharma, G.; Nguyen, Q.; Hein, M.; and Schiele, B. 2016. Latent embeddings for zero-shot classification. In CVPR, 69–77.
- Xian et al. (2018a) Xian, Y.; Lampert, C. H.; Schiele, B.; and Akata, Z. 2018a. Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. PAMI.
- Xian et al. (2018b) Xian, Y.; Lorenz, T.; Schiele, B.; and Akata, Z. 2018b. Feature generating networks for zero-shot learning. In CVPR.
- Zhang et al. (2017) Zhang, H.; Xu, T.; Li, H.; Zhang, S.; Wang, X.; Huang, X.; and Metaxas, D. 2017. StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks. In ICCV.
- Zhang et al. (2019) Zhang, J.; Kalantidis, Y.; Rohrbach, M.; Paluri, M.; Elgammal, A.; and Elhoseiny, M. 2019. Large-scale visual relationship understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 9185–9194.
- Zhang, Xiang, and Gong (2016) Zhang, L.; Xiang, T.; and Gong, S. 2016. Learning a Deep Embedding Model for Zero-Shot Learning. In CVPR.
- Zhang et al. (2018) Zhang, R.; Che, T.; Ghahramani, Z.; Bengio, Y.; and Song, Y. 2018. MetaGAN: An Adversarial Approach to Few-Shot Learning. In Advances in Neural Information Processing Systems, 2371–2380.
- Zhu et al. (2018) Zhu, Y.; Elhoseiny, M.; Liu, B.; Peng, X.; and Elgammal, A. 2018. A generative adversarial approach for zero-shot learning from noisy texts. In CVPR.