Decorating Your Own Bedroom: Locally Controlling Image Generation
with Generative Adversarial Networks

Chen Zhang¹ Yinghao Xu² Yujun Shen²
¹Zhejiang University ²The Chinese University of Hong Kong
[email protected] {xy119, sy116}@ie.cuhk.edu.hk

Abstract

Generative Adversarial Networks (GANs) have made great success in synthesizing high-quality images. However, how to steer the generation process of a well-trained GAN model and customize the output image is much less explored. It has been recently found that modulating the input latent code used in GANs can reasonably alter some variation factors in the output image, but such manipulation usually presents to change the entire image as a whole. In this work, we propose an effective approach, termed as LoGAN, to support local editing of the output image. Concretely, we introduce two operators, i.e., content modulation and style modulation, together with a priority mask to facilitate the precise control of the intermediate generative features. Taking bedroom synthesis as an instance, we are able to seamlessly remove, insert, shift, and rotate the individual objects inside a room. Furthermore, our method can completely clear out a room and then refurnish it with customized furniture and styles. Experimental results show the great potentials of steering the image generation of pre-trained GANs for versatile image editing.

Refer to caption — Figure 1: Local editing achieved by LoGAN using the well-trained StyleGAN2 [17] model for bedroom synthesis. Top row presents some basic operations on individual objects, while bottom row demonstrates clearing out the room, refurnishing, and further changing style.

1 Introduction

Recent years have witnessed the significant advance of Generative Adversarial Networks (GANs) in image synthesis [15, 16, 6, 17]. However, precisely steering the generation with a pre-trained GAN model and customizing the output image remain unsolved. Recent studies focus on global editing but adjusting fragments is more critical in many practical cases. In this paper, we propose an approach for the local control of the image generation with GANs, termed as LoGAN. Different from prior work [3, 4], LoGAN performs manipulation from the feature space instead of the initial latent space. We find that the intermediate feature maps developed by GAN generators effectively encode rich spatial information, which supports editing the output within local regions. With the state-of-the-art StyleGAN2 [17] model learned for bedroom synthesis, we design content and style modulation to implement flexible object composition. In addition, we introduce the priority mask to handle the spatial relationship between different objects. In this way, we can impressively edit the single object and further clear out then re-decorate the whole bedroom with customized furniture and styles, as shown in Fig. Decorating Your Own Bedroom: Locally Controlling Image Generation with Generative Adversarial Networks.

1.1 Related Work

Generative Adversarial Networks. GAN [10] is proposed to learn the mapping from a pre-defined latent space to the observed image space via adversarial training. Many GAN variants have been proposed to improve the synthesis quality and training stability [2, 11, 5, 30, 20, 15, 16, 6, 17]. However, controlling the generation process of well-trained GANs is much less explored.

Image Editing with GANs. It has been recently found that GANs spontaneously learn rich semantics when trained to synthesize images [22, 9, 13, 26]. These semantics are commonly characterized as some particular directions in the latent space. But manipulation in the latent space [22, 26] tends to change the whole image. Differently, some recent studies [23, 8, 1] propose to utilize the spatial feature maps generated by GANs for local editing. Nevertheless, they mainly focus on transferring one object to another [23, 1] or stylizing one object using another as the reference [8]. Some work [24, 19, 28, 21, 25] particularly explores image compositing which fuses some unrelated fragments together into a reasonable image. Some studies [29, 18, 14] focus on locally editing face images, like changing the nose of or adding bangs to a target face. In comparison, bedroom synthesis studied in this work is more challenging due to the high complexity of the room layout and the large diversity of the objects inside. More importantly, LoGAN supports customizing the room decoration without any retraining.

2 Framework of LoGAN

The framework is shown as Fig. 2. Precisely, with a content modulation operator $\mathtt{CMod}(\cdot,\cdot)$ , a style modulation operator $\mathtt{SMod}(\cdot,\cdot)$ , and an effective priority mask, LoGAN achieves local control of the image generation with GANs.

Preliminary on Style-Based Generator. The style-based generator [16, 17] proposes to adjust the layer-wise image styles using Adaptive Instance Normalization (AdaIN) [12] operation. Accordingly, the layer-wise style control can be formulated as

\displaystyle\mathtt{A}({\rm\bf F}^{(\ell)},{\rm\bf w}^{(\ell)})=\ \{\mathtt{AdaIN}({\rm\bf F}^{(\ell)}_{c},T^{(\ell)}({\rm\bf w}^{(\ell)}))\}_{c=1}^{C},\vspace{-6pt}

(1)

where $c$ and $C$ respectively denote the channel index and total number of feature channels, and $T(\cdot)$ means the learned affine transformations who specialize the latent code into styles. In this way, given the spatial feature maps ${\rm\bf F}^{(\ell)}$ for the particular $\ell$ -th layer, we can adjust the overall style by altering the corresponding latent code ${\rm\bf w}^{(\ell)}$ .

Layout Parsing. In order to predict the layout of bedroom, we propose a background completion approach by solving the occlusion problem. As shown in Fig. 3, we treat ceilings, walls, and floors as the basic components of background. Based on the segmented map from semantic segmentation model [7], we first find the convex closure for ceiling, then get the key point $(x_{C},y_{C})$ . In the meantime, we find $(x_{L},y_{L})$ and $(x_{R},y_{R})$ from both sides of the image and form the floor boundary with assigned slope. Such auto-searching method is capable of getting well-shaped bedroom layouts, which are very close to the real cases.

Content Modulation. To enable content editing, we propose a content modulation operator based on the generative features as

\mathtt{CMod}({\rm\bf F}^{(\ell)}\!,\!({\rm\bf F}_{o}^{(\ell)}\!,\!{\rm\bf m}_{o}^{(\ell)})\!)\!=\!{\rm\bf F}^{(\ell)}\!\odot\!(1\!-\!{\rm\bf m}_{o}^{(\ell)}\!)\!+\!{\rm\bf F}_{o}^{(\ell)}\!\odot\!{\rm\bf m}_{o}^{(\ell)},\vspace{-8pt}

(2)

where ${\rm\bf m}_{o}^{(\ell)}$ and ${\rm\bf F}_{o}^{(\ell)}$ represent the mask and feature map of the editing region. The $\odot$ denotes the pixel-wise multiplication along the channel index. With Eq. (2), we can achieve object insertion and removal by replacing ${\rm\bf F}_{o}^{(\ell)}$ with ${\rm\bf F}_{b}^{(\ell)}$ to refill the erased region with background feature.

Object Clustering and Rotation. To study object generating, we firstly vectorize the object region with downsampled height and width as $(H_{s},W_{s})$ . Then we perform k-means algorithm to find $M$ clusters representing different shapes of the object. Finally we propose to interpolate the layer-wise latent code ${\rm\bf w}^{(\ell)}$ between clustering centers as

{\rm\bf w}_{s}^{(\ell)}={\rm\bf w}_{l}^{(\ell)}+\frac{s}{S}({\rm\bf w}_{r}^{(\ell)}-{\rm\bf w}_{l}^{(\ell)}),s=0,1,\cdots,S,\vspace{-8pt}

(3)

where ${\rm\bf w}_{l}^{(\ell)}$ and ${\rm\bf w}_{r}^{(\ell)}$ respectively denote the latent codes of clustered centers on left and right object poses. Taking ${\rm\bf w}_{s}^{(\ell)}$ as the input, we manage to rotate the object inside an image.

Priority Mask. Real cases show the object regions often have overlaps with each other, so we introduce “priority” into the mask called priority mask ${\rm\bf m}_{o}^{(\ell)}(p)$ . We simply design the priority of different objects according to their decoration sequence in common sense. Larger value of $p$ determines the higher priority of execution order in the modulation procedure. The priority mask is computed as

{\rm\bf m}_{o}^{(\ell)}={\rm\bf m}_{o}^{(\ell)}(p)\times(1-\sum_{o^{\prime}\neq o}{\rm\bf m}_{o^{\prime}}^{(\ell)}(p^{\prime})\mathds{1}_{p^{\prime}>p})^{+},\vspace{-8pt}

(4)

where $\mathds{1}$ denotes the indicator function. Such design turns object masks $\{{\rm\bf m}_{o}^{(\ell)}\}_{o=1}^{N}$ into disjoint regions, which ensures the customization of the local synthesis control.

Style Modulation. We should consider to locally control the style when editing an object. To this end, we propose the style modulation operator $\mathtt{SMod}(\cdot,\cdot)$ , formulated as

		$\displaystyle\mathtt{SMod}(({\rm\bf F}^{(\ell)},{\rm\bf w}^{(\ell)}),({\rm\bf F}_{o}^{(\ell)},{\rm\bf w}_{o}^{(\ell)},{\rm\bf m}_{o}^{(\ell)}))$
	$\displaystyle=\$	$\displaystyle\mathtt{A}({\rm\bf F}^{(\ell)},{\rm\bf w}^{(\ell)})\odot(1\!-\!{\rm\bf m}_{o}^{(\ell)})\!+\!\mathtt{A}({\rm\bf F}_{o}^{(\ell)},{\rm\bf w}_{o}^{(\ell)})\odot{\rm\bf m}_{o}^{(\ell)}.\vspace{-8pt}$		(5)

Here, ${\rm\bf w}^{(\ell)}$ is still used to control the global style, same as in Eq. (1). Differently, LoGAN assigns a code ${\rm\bf w}_{o}^{(\ell)}$ for each particular region ${\rm\bf m}_{o}^{(\ell)}$ and hence supports controlling the style of individual objects independently.

Layer-wise Synthesis. At length, our overall local editing system embeds $\mathtt{CMod}(\cdot,\cdot)$ and $\mathtt{SMod}(\cdot,\cdot)$ into the layer-wise generator $G(\cdot)\triangleq G^{(L)}\circ G^{(L-1)}\circ\cdots\circ G^{(1)}(\cdot)$ . The layer-wise generation process is summarized in Algorithm 1.

Input: initial feature map:

{\rm\bf F}^{(1)}

layer-wise base latent code:

{\rm\bf w}^{(*)}

layer-wise object list:

\text{L}[*]

layer-wise object:

\{{\rm\bf F}_{o}^{(*)}\}_{o},\{{\rm\bf w}_{o}^{(*)}\}_{o},\{{\rm\bf m}_{o}^{(*)}\}_{o}

Output: final synthesis:

{\rm\bf F}^{(L+1)}

for $\ell$ -th layer in generator do

for object $o$ in $\text{L}[\ell]$ with $({\rm\bf F}_{o}^{(\ell)},{\rm\bf w}_{o}^{(\ell)},{\rm\bf m}_{o}^{(\ell)})$ do

update priority mask:

{\rm\bf m}_{o}^{(\ell)}\leftarrow{\rm\bf m}_{o}^{(\ell)}

;

edit content:

{\rm\bf F}^{(\ell)}\leftarrow\mathtt{CMod}({\rm\bf F}^{(\ell)},({\rm\bf F}_{o}^{(\ell)},{\rm\bf m}_{o}^{(\ell)}))

;

edit style:

{\rm\bf F}^{(\ell)}\leftarrow\mathtt{SMod}(({\rm\bf F}^{(\ell)},{\rm\bf w}^{(\ell)}),({\rm\bf F}_{o}^{(\ell)},{\rm\bf w}_{o}^{(\ell)},{\rm\bf m}_{o}^{(\ell)}))

;

end for

generate the next feature map

{\rm\bf F}^{(\ell+1)}\leftarrow G^{(\ell)}({\rm\bf F}^{(\ell)})

;

end for

ALGORITHM 1 Local editing with layer-wise synthesis.

3 Experiments

Here we validate the effectiveness of LoGAN on the StyleGAN2 [17] model trained on LSUN [27] bedrooms.

3.1 Single Object Removal

Fig. 4 shows the results of removing a bed, window, and picture from the source image. For each object category, we perform the removal operation at the 4-th, 6-th, 8-th, and 10-th layers respectively. We observe that manipulation at 4-th layer could better blend background feature with object feature, leading to more satisfying results. That is because the convolutional kernels at early layers have larger receptive field. By contrast, editing at later layers makes the removal region much sharper, especially at the mask boundaries. As a result, we recommend removing objects at the 4-th layer.

3.2 Creating an Empty Bedroom

As illustrated in Fig. 5, LoGAN creates some empty bedrooms for example. It’s noticed that the layout of the empty room satisfyingly matches the source room and the texture of the background is also well preserved from the source image, demonstrating the effectiveness of the proposed layout parsing algorithm and the feature manipulation algorithm.

3.3 Clustering and Rotating Beds

The results evaluate that clustering on downsampled object mask could easily discriminate the shape and orientation of bed. Besides, we manage to rotate the bed in source image to other poses while preserving the original color and texture. We conclude that interpolating the latent codes between clustering centers is able to control the object then help to realize the rotation.

3.4 Inserting a New Object

Fig. 7 presents the results of inserting the target object at the 4-th, 7-th, 10-th, 13-th layer of the generator. The picture and window manipulation from 7-th to 13-th layers tend to give similar results, all of which can well preserve the shape and texture of the target object. In comparison, the editing results at the 4-th layer of all these three object categories are eroded at the region boundary. The reason is the inserted feature is blended with the background feature due to the larger receptive field of former convolutional kernels. Consequently, we recommend to insert these objects at the 7-th layer to gain better results.

3.5 Bedroom Compositing

We conduct comprehensive experiments when decorating bedroom with multiple objects. The first column of Fig. 8 acts as the base image of each row for further editing. We first compose bed, window and picture in the generated empty bedroom, then decorate the room with diverse tables and lamps. It’s concluded that LoGAN could perfectly organize same or different objects at designated place inside the room. The results demonstrate the flexibility and robustness of LoGAN in customizing the bedroom composition.

3.6 Changing Style

According to the formulation of StyleGAN [16], the sampled latent codes can be easily used to alter the overall image style. In particular, we can sample numerous latent codes from the latent space and use them to guide the generation process. Fig. 9 visualizes some re-stylization results of the decorated room with different color schemes.

4 Conclusion

In this work, we present LoGAN to locally control the image generation with GANs by manipulating the intermediate feature maps. With the novel priority mask and two carefully designed modulation operators, we utilize the state-of-the-art StyleGAN2 model for photo-realistic image local editing. This work sheds light on using well-learned GAN models to facilitate various image manipulation tasks. We believe more works would be built on LoGAN to advance a wide range of computer graphics applications.

References

[1] Rameen Abdal, Yipeng Qin, and Peter Wonka. Image2stylegan++: How to edit the embedded images? In IEEE Conf. Comput. Vis. Pattern Recog., 2020.
[2] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In ICML, 2017.
[3] David Bau, Alex Andonian, Audrey Cui, YeonHwan Park, Ali Jahanian, Aude Oliva, and Antonio Torralba. Paint by word. arXiv preprint arXiv:2103.10951, 2021.
[4] David Bau, Hendrik Strobelt, William Peebles, Jonas Wulff, Bolei Zhou, Jun-Yan Zhu, and Antonio Torralba. Semantic photo manipulation with a generative image prior. arXiv preprint arXiv:2005.07727, 2020.
[5] David Berthelot, Thomas Schumm, and Luke Metz. Began: Boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717, 2017.
[6] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. In Int. Conf. Learn. Represent., 2019.
[7] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
[8] Edo Collins, Raja Bala, Bob Price, and Sabine Susstrunk. Editing in style: Uncovering the local semantics of gans. In IEEE Conf. Comput. Vis. Pattern Recog., 2020.
[9] Lore Goetschalckx, Alex Andonian, Aude Oliva, and Phillip Isola. Ganalyze: Toward visual definitions of cognitive image properties. In Int. Conf. Comput. Vis., 2019.
[10] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Adv. Neural Inform. Process. Syst., 2014.
[11] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Adv. Neural Inform. Process. Syst., 2017.
[12] Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In Int. Conf. Comput. Vis., 2017.
[13] Ali Jahanian, Lucy Chai, and Phillip Isola. On the ”steerability” of generative adversarial networks. In Int. Conf. Learn. Represent., 2020.
[14] Youngjoo Jo and Jongyoul Park. Sc-fegan: Face editing generative adversarial network with user’s sketch and color. In Int. Conf. Comput. Vis., 2019.
[15] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In Int. Conf. Learn. Represent., 2018.
[16] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In IEEE Conf. Comput. Vis. Pattern Recog., 2019.
[17] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In IEEE Conf. Comput. Vis. Pattern Recog., 2020.
[18] Cheng-Han Lee, Ziwei Liu, Lingyun Wu, and Ping Luo. Maskgan: Towards diverse and interactive facial image manipulation. In IEEE Conf. Comput. Vis. Pattern Recog., 2020.
[19] Chen-Hsuan Lin, Ersin Yumer, Oliver Wang, Eli Shechtman, and Simon Lucey. St-gan: Spatial transformer generative adversarial networks for image compositing. In IEEE Conf. Comput. Vis. Pattern Recog., 2018.
[20] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In Int. Conf. Learn. Represent., 2018.
[21] Thu Nguyen-Phuoc, Christian Richardt, Long Mai, Yong-Liang Yang, and Niloy Mitra. Blockgan: Learning 3d object-aware scene representations from unlabelled images. In Adv. Neural Inform. Process. Syst., 2020.
[22] Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou. Interpreting the latent space of gans for semantic face editing. In IEEE Conf. Comput. Vis. Pattern Recog., 2020.
[23] Ryohei Suzuki, Masanori Koyama, Takeru Miyato, Taizan Yonetsuji, and Huachun Zhu. Spatially controllable image synthesis with internal representation collaging. arXiv preprint arXiv:1811.10153, 2018.
[24] Yi-Hsuan Tsai, Xiaohui Shen, Zhe Lin, Kalyan Sunkavalli, Xin Lu, and Ming-Hsuan Yang. Deep image harmonization. In IEEE Conf. Comput. Vis. Pattern Recog., 2017.
[25] Hung-Yu Tseng, Hsin-Ying Lee, Lu Jiang, Ming-Hsuan Yang, and Weilong Yang. Retrievegan: Image synthesis via differentiable patch retrieval. In European Conference on Computer Vision, pages 242–257. Springer, 2020.
[26] Ceyuan Yang, Yujun Shen, and Bolei Zhou. Semantic hierarchy emerges in deep generative representations for scene synthesis. arXiv preprint arXiv:1911.09267, 2019.
[27] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
[28] Fangneng Zhan, Hongyuan Zhu, and Shijian Lu. Spatial fusion gan for image synthesis. In IEEE Conf. Comput. Vis. Pattern Recog., 2019.
[29] Gang Zhang, Meina Kan, Shiguang Shan, and Xilin Chen. Generative adversarial network with spatial attention for face attribute editing. In Eur. Conf. Comput. Vis., 2018.
[30] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. In ICML, 2019.