Hierarchical Diffusion Autoencoders and Disentangled Image Manipulation

\externaldocument

_main

We thank the reviewers for their comments. We refer to reviewer HzYG as R1, wGwL as R2, and wjos as R3. We are encouraged that they found our method well-motivated (R1), effective (R2), with impressive results (R1,R2,R3), important for the research community (R2), and well-written (R2, R3). We will revise the paper according to the suggestions.

R1 Q1 Results on other datasets. More reconstruction and generation results on LSUN bedrooms and horses are shown in Fig.1(a,b). We will add more results in the final version.

R1 Q2 Higher resolution images. We follow DAE [29](CVPR 2022) to experiment on $256\times 256$ images and our approach can be easily generalized to generate $512\times 512$ images by cascading a super-resolution diffusion model.

R1 Q3 Comparison with other inversion methods. Shown in Fig.1(a). Our MSE error for image reconstruction is 5.01e-5 compared with 0.014 by PTI (Roich et al, Pivotal Tuning for Latent-based Editing of Real Images, ACM TOG 2022) and 0.05 by E4E[7](Tov et al, Designing an encoder for stylegan image manipulation, ACM TOG 2021).

R1 Q4 Harder editing. We show more editing examples such as gender, beard, and eyeglasses in Fig.8 of the supplementary materials. Our method works on the 40 attributes annotated on CelebA-HQ and can be generalized to other attributes as long as we can train a classifier. We do not experiment on head poses due to the lack of such annotations.

R1 Additional Comments. As shown in Fig.6 in the paper, HDAE(U+) performs slightly better on MSE but slightly worse on LPIPS. HDAE(U+) has 30M more parameters than HDAE(U) without much performance gain. So the optimal model is HDAE(U). We will clarify this in the final version.

R2 Q1 Novelty. HDAE is not a naive extension of DAE since the hierarchical design is non-trivial and we experimented multiple designs to find the best one (Fig.2 main paper). The coarse-to-fine feature hierarchy enables more applications than DAE such as style mixing, controlled interpolation, and multimodal semantic image synthesis. The truncated feature approach further enables disentangled editing.

R2 Q2 Justification of more efficient training. We extend the bottleneck of DAE to 2560, the same size as the bottleneck of HDAE. At 1000 training steps, the validation set reconstruction MSE of DAE(2560) and HDAE are 4.40e-3 and 2.84e-3, respectively. This fair comparison shows that the hierarchical structure does improve the efficiency and effectiveness of the model.

R2 Q3 Richer semantics and disentangled representations. We agree with the reviewer that HDAE disentangles semantics at different scales and we will clarify this in the final version. To demonstrate that our latent space has richer semantics, we evaluate the predictive power of our latent code by linear probing with 40 face attributes of CelebA-HQ. Tab.1 indicates that HDAE has richer semantics and the features are disentangled at different levels.

Method	Attributes related to high-level				Attributes related to low-level				Macro average
Method	Smile	Eveglass	Oval_Face	Young	Blackhair	PaleSkin	Brownhair	5_o_Clock_Shadow	on all attributes
HDAE Low-level code	0.8874	0.894	0.6397	0.8537	\uline0.8867	\uline0.9107	\uline0.8196	\uline0.8938	0.8418
HDAE High-level code	\uline0.9705	\uline0.9552	0.668	\uline0.8659	0.789	0.7809	0.673	0.8535	0.8452
HDAE Whole code	0.9711	0.9691	\uline0.665	0.8831	0.8887	0.928	0.8234	0.9067	0.8823
DAE(2560)	0.9651	0.9663	0.6456	0.8771	0.8797	0.9202	0.7852	0.8959	0.8673

Table 1: R2 Q3 Linear probing on 40 attributes of CelebA-HQ.

Refer to caption — Figure 1: More results will be added in the final version.

R2 Q4 Unconditional generation. Fig.1(b) shows unconditional generation on FFHQ, LSUN bedrooms and horses.

R2 Additional comments. Thank you for your suggestions. We will revise accordingly and polish the paper.

R3 Q1 Q2 Disentangled representations. [a](Locatello et al, Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations, ICML 2019) proves that unsupervised learning of strict disentanglement is impossible without inductive bias. Instead of perfect disentanglement, our claim is that the HDAE is better than DAE at disentangling features at different scales, i.e., low-level and high-level features are encoded in different levels of the hierarchical latent space. The disentanglement at different levels is possible because of the inductive bias in the U-Net architecture. The plot on CIFAR10 (Fig.1)(d) shows similar patterns that most high activations are from one feature level. We will add the plot of more attributes in the Appendix. Our HDAE only disentangles features at different scales, so we need the truncated feature based approach to further disentangle features of different semantic attributes for image editing. We will add the discussions in the final version.

R3 Q3 DAE with truncated features. We add DAE with truncated features as an additional baseline in Fig.1(c).

R3 Q4 CIFAR10 experiments. We train HDAE on CIFAR10 and evaluate it with linear classification with the learned features. The HDAE and DAE features achieve 66.34 and 61.85 average classification accuracy, respectively.

R3 Q5 User study details. We provide the original image and the manipulated/reconstructed image and ask the users to choose which one is better, or similar. We adopt the face attributes from CelebA-HQ annotations for image manipulation evaluation. We will add details in the final version.

R3 Q6 Number of parameters. DAE has 123M parameters and HDAE(U) has 189M parameters. For a fair comparison, we also add a baseline of DAE(2560) with 190M parameters which performs worse than HDAE(U) (R2 Q2 for details).