This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Plug-In Inversion: Model-Agnostic Inversion for Vision with Data Augmentations

Amin Ghiasi    Hamid Kazemi    Steven Reich    Chen Zhu    Micah Goldblum    Tom Goldstein
Abstract

Existing techniques for model inversion typically rely on hard-to-tune regularizers, such as total variation or feature regularization, which must be individually calibrated for each network in order to produce adequate images. In this work, we introduce Plug-In Inversion, which relies on a simple set of augmentations and does not require excessive hyper-parameter tuning. Under our proposed augmentation-based scheme, the same set of augmentation hyper-parameters can be used for inverting a wide range of image classification models, regardless of input dimensions or the architecture. We illustrate the practicality of our approach by inverting Vision Transformers (ViTs) and Multi-Layer Perceptrons (MLPs) trained on the ImageNet dataset, tasks which to the best of our knowledge have not been successfully accomplished by any previous works.

Machine Learning, ICML

1 Introduction

Model inversion is an important tool for visualizing and interpreting behaviors inside neural architectures, understanding what models have learned, and explaining model behaviors. In general, model inversion seeks inputs that either activate a feature in the network (feature visualization) or yield a high output response for a particular class (class inversion) (Olah et al., 2017). Model inversion and visualization has been a cornerstone of conceptual studies that reveal how networks decompose images into semantic information (Zeiler & Fergus, 2014; Dosovitskiy & Brox, 2016). Over time, inversion methods have shifted from solving conceptual problems to solving practical ones. Saliency maps, for example, are image-specific model visualizations that reveal the inputs that most strong influence a model’s decisions (Simonyan et al., 2014).

Recent advances in network architecture pose major challenges for existing model inversion schemes. Convolutional Neural Networks (CNN) have long been the de-facto approach for computer vision tasks, and are the focus of nearly all research in the model inversion field. Recently, other architectures have emerged that achieve results competitive with CNNs. These include Vision Transformers (ViTs; Dosovitskiy et al., 2021), which are based on self-attention layers, and MLP-Mixer (Tolstikhin et al., 2021) and ResMLP (Touvron et al., 2021a), which are based on Multi Layer Perceptron layers. Unfortunately, most existing model inversion methods either cannot be applied to these architectures, or are known to fail. For example, the feature regularizer used in DeepInversion (Yin et al., 2020) cannot be applied to ViTs or MLP-based models because they do not include Batch Normalization layers (Ioffe & Szegedy, 2015).

In this work, we focus on class inversion, the goal of which is to find interpretable images that maximize the score a classification model assigns to a chosen label without knowledge about the model’s training data. Class inversion has been used for a variety of tasks including model interpretation (Mordvintsev et al., 2015), image synthesis (Santurkar et al., 2019), and data-free knowledge transfer (Yin et al., 2020). However, current inversion methods have several key drawbacks. The quality of generated images is often highly sensitive to the weights assigned to regularization terms, so these hyper-parameters need to be carefully calibrated for each individual network. In addition, methods requiring batch norm parameters are not applicable to emerging architectures.

To overcome these limitations, we present Plug-In Inversion (PII), an augmentation-based approach to class inversion. PII does not require any explicit regularization, which eliminates the need to tune regularizer-specific hyper-parameters for each model or image instance. We show that PII is able to invert CNNs, ViTs, and MLP networks using the same architecture-agnostic method, and with the same architecture-agnostic hyper-parameters.

We summarize our contributions as follows:

  • \bullet

    We provide a detailed analysis of various augmentations and how they affect the quality of images produced via class inversion.

  • \bullet

    We introduce Plug-In Inversion (PII), a new class inversion technique based on these augmentations, and compare it to existing techniques.

  • \bullet

    We apply PII to dozens of different pre-trained models of varying architecture, justifying the claim that it can be ‘plugged in’ to most networks without modification.

  • \bullet

    In particular, we show that PII succeeds in inverting ViTs and large MLP-based architectures, which to our knowledge has not previously been accomplished.

  • \bullet

    Finally, we explore the potential for combining PII with prior methods.

2 Background

2.1 Class inversion

In the basic procedure for class inversion, we begin with a pre-trained model ff and chosen target class yy. We randomly initialize (and optionally pre-process) an image 𝐱\mathbf{x} in the input space of ff. We then perform gradient descent to solve the optimization problem x^=argmin𝐱(f(𝐱),y)\hat{x}=\operatorname*{arg\,min}_{\mathbf{x}}\mathcal{L}(f(\mathbf{x}),y) for a chosen objective function \mathcal{L} to produce a class image x^\hat{x}. For very shallow networks and small datasets, letting \mathcal{L} be cross-entropy or even the negative confidence assigned to the true class can produce recognizable images with minimal pre-processing (Fredrikson et al., 2015). Modern deep neural networks, however, cannot be inverted as easily.

2.2 Regularization

Most prior work on class inversion for deep networks has focused on carefully designing the objective function to produce quality images. This entails combining a divergence term (e.g. cross-entropy) with one or more regularization terms (image priors) meant to guide the optimization towards an image with ‘natural’ characteristics. DeepDream (Mordvintsev et al., 2015), following work on feature inversion (Mahendran & Vedaldi, 2015), uses two such terms: 2(𝐱)=𝐱22\mathcal{R}_{\ell_{2}}(\mathbf{x})=\|\mathbf{x}\|_{2}^{2}, which penalizes the magnitude of the image vector, and total variation, defined as TV(𝐱)=Δi{0,1}Δj{0,1}(i,j(xi+Δi,j+Δjxi,j)2)12\mathcal{R}_{TV}(\mathbf{x})=\sum_{\begin{subarray}{c}\Delta_{i}\in\{0,1\}\\ \Delta_{j}\in\{0,1\}\end{subarray}}\left(\sum_{i,j}(x_{i+\Delta_{i},j+\Delta_{j}}-x_{i,j})^{2}\right)^{\frac{1}{2}}, which penalizes sharp changes over small distances.

DeepInversion (Yin et al., 2020) uses both of these regularizers, along with the feature regularizer feat(𝐱)=k(μk(𝐱)μ^k2+σk2(𝐱)σ^k22)\mathcal{R}_{feat}(\mathbf{x})=\sum_{k}\left(\|\mu_{k}(\mathbf{x})-\hat{\mu}_{k}\|_{2}+\|\sigma_{k}^{2}(\mathbf{x})-\hat{\sigma}_{k}^{2}\|_{2}\right), where μk,σk2\mu_{k},\sigma_{k}^{2} are the batch mean and variance of the features output by the kk-th convolutional layer, and μ^k,σ^k2\hat{\mu}_{k},\hat{\sigma}_{k}^{2} are corresponding Batch Normalization statistics stored in the model (Ioffe & Szegedy, 2015). Naturally, this method is only applicable to models that use Batch Normalization, which leaves out ViTs, MLPs, and even some CNNs. Furthermore, the optimal weights for each regularizer in the objective function vary wildly depending on architecture and training set, which presents a barrier to easily applying such methods to a wide array of networks.

w/ Centering w/o Centering
Init Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Final Final
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 1: An image at different stages of optimization with centering (left), and an image inverted without centering (right), for the Border Terrier class of a robust ResNet-50.
w/ Zoom w/o Zoom
Init Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Final Final
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 2: An image during different stages of optimization with zoom (left), and an image inverted without zoom (right), for the Jay class of a robust ResNet-50.

2.3 Architectures for vision

We now present a brief overview of the three basic types of vision architectures that we will consider.

Convolutional Neural Networks (CNNs) have long been the standard in deep learning for computer vision (LeCun et al., 1989; Krizhevsky et al., 2012). Convolutional layers encourage a model to learn properties desirable for vision tasks, such as translation invariance. Numerous CNN models exist, mainly differing in the number, size, and arrangement of convolutional blocks and whether they include residual connections, Batch Normalization, or other modifications (He et al., 2016; Zagoruyko & Komodakis, 2016; Simonyan & Zisserman, 2014).

Dosovitskiy et al. (2021) recently introduced Vision Transformers (ViTs), adapting the Transformer architectures commonly used in NLP (Vaswani et al., 2017). ViTs break input images into patches, combine them with positional embeddings, and use these as input tokens to self-attention modules. Some proposed variants require less training data (Touvron et al., 2021c), have convolutional inductive biases (d’Ascoli et al., 2021), or make other modifications to the attention modules (Chu et al., 2021; Liu et al., 2021b; Xu et al., 2021).

Subsequently, a number of authors have proposed vision models which are based solely on Multi-Layer Perceptrons (MLPs), using insights from ViTs (Tolstikhin et al., 2021; Touvron et al., 2021a; Liu et al., 2021a). Generally, these models use patch embeddings similar to ViTs and alternate channel-wise and patch-wise linear embeddings, along with non-linearities and normalization.

We emphasize that as the latter two architecture types are recent developments, our work is the first to study them in the context of model inversion.

3 Plug-In Inversion

Prior work on class inversion uses augmentations like jitter, which randomly shifts an image horizontally and vertically, and horizontal flips to improve the quality of inverted images (Mordvintsev et al., 2015; Yin et al., 2020). The hypothesis behind their use is that different views of the same image should result in similar scores for the target class. These augmentations are applied to the input before feeding it to the network, and different augmentations are used for each gradient step used to reconstruct xx. In this section, we explore additional augmentations that benefit inversion before describing how we combine them to form the PII algorithm.

log(λtv):log(\lambda_{tv}): 9-9 8-8 7-7 6-6 5-5 4-4
w/ CS Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
w/o CS Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 3: Inversions of the robust ResNet-50 ATM class, with and without ColorShift and with varying TV regularization strength. The inversion process with ColorShift is robust to changes in the λtv\lambda_{tv} hyper-parameter, while without it, λtv\lambda_{tv} seems to present a trade-off between noise and blur.
e=1e=1 e=2e=2 e=4e=4 e=8e=8 e=16e=16 e=32e=32 e=64e=64
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 4: Effect of ensemble size in the quality of inverted images for the Tugboat class of a robust ResNet-50.

As robust models are typically easier to invert than naturally trained models (Santurkar et al., 2019; Mejia et al., 2019), we use a robust ResNet-50 (He et al., 2016) model trained on the ImageNet (Deng et al., 2009) dataset throughout this section as a toy example to examine how different augmentations impact inversion. Note, we perform the demonstrations in this section under slightly different conditions and with different models than those ultimately used for PII in order to highlight the effects of the augmentations as clearly as possible. The reader may find thorough experimental details in the appendix, section C.

3.1 Restricting Search Space

In this section, we consider two augmentations to improve the spatial qualities of inverted images: Centering and Zoom. These are designed based on our hypothesis that restricting the input optimization space encourages better placement of recognizable features. Both methods start with small input patches, and each gradually increases this space in different ways to reach the intended input size. In doing so, they force the inversion algorithm to place important semantic content in the center of the image.

Centering

Let xx be the input image being optimized. At first, we only optimize a patch at the center of xx. After a fixed number of iterations, we increase the patch size outward by padding with random noise, repeating this until the patch reaches the full input size. Figure 1 shows the state of the image prior at each stage of this process, as well as an image produced without centering. Without centering, the shift invariance of the networks allows most semantic content to scatter to the image edges. With centering, results remain coherent.

Zoom

For zoom, we begin with an image xx of lower resolution than the desired result. In each step, we optimize this image for a fixed number of iterations and then up-sample the result, repeating until we reach the full resolution. Figure 2 shows the state of an image at each step of the zoom procedure, along with an image produced without zoom. The latter image splits the object of interest at its edges. By contrast, zoom appears to find a meaningful structure for the image in the early steps and refines details like texture as the resolution increases.

We note that zoom is not an entirely novel idea in inversion. Yin et al. (2020) use a similar technique as ‘warm-up’ for better performance and speed-up. However, we observe that continuing zoom throughout optimization contributes to the overall success of PII.

Z Z + C C None Z Z + C C None
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
w/ ColorShift w/o ColorShift
Figure 5: The effect of various combinations of zoom, centering, and ColorShift when inverting the Dipper class using a naturally-trained ResNet-50.

Zoom + Centering

Unsurprisingly, we have found that applying zoom and centering simultaneously yields even better results than applying either individually, since each one provides a different benefit. Centering places detailed and important features (e.g. the dog’s eye in Figure 1) near the center and builds the rest of the image around the existing patch. Zoom helps enforce a sound large-scale structure for the image and fills in details later.

The combined Zoom and Centering process proceeds in ‘stages’, each at a higher resolution than the last. Each stage begins with an image patch generated by the previous stage, which approximately minimizes the inversion loss. The patch is then up-sampled to a resolution halfway between the previous stage and current stage resolution, filling the center of the image and leaving a border which is padded with random noise. The next round of optimization then begins starting from this newly processed image.

3.2 ColorShift Augmentation

The colors of the illustrative images we have shown so far are notably different from what one might expect in a natural image. This is due to ColorShift, a new augmentation that we now present.

ColorShift is an adjustment of an image’s colors by a random mean and variance in each channel. This can be formulated as follows:

ColorShift(𝐱)=σ𝐱μ,\text{ColorShift}(\mathbf{x})=\sigma\mathbf{x}-\mu,

where μ\mu and σ\sigma are CC-dimensional111CC being the number of channels vectors drawn from 𝒰(α,α)\mathcal{U}(-\alpha,\alpha) and e𝒰(β,β)e^{\mathcal{U}(-\beta,\beta)}, respectively, and are repeatedly redrawn after a fixed number of iterations. We use α=β=1.0\alpha=\beta=1.0 in all demonstrations unless otherwise noted. At first glance, this deliberate shift away from the distribution of natural images seems counterproductive to the goal of producing a recognizable image. However, our results show that using ColorShift noticeably increases the amount of visual information in inverted images and also obviates the need for hard-to-tune regularizers to stabilize optimization.

We visualize the stabilizing effect of ColorShift in Figure 3. In this experiment, we invert the model by minimizing the sum of a cross entropy and a total-variation (TV) penalty. Without ColorShift, the quality of images is highly dependent on the weight λTV\lambda_{TV} of the TV regularizer; smaller values produce noisy images, while larger values produce blurry ones. Inversion with ColorShift, on the other hand, is insensitive to this value and in fact succeeds when omitting the regularizer altogether.

Other preliminary experiments show that ColorShift similarly removes the need for 2\ell_{2} or feature regularization, as our main results for PII will show. We conjecture that by forcing unnatural colors into an image, ColorShift requires the optimization to find a solution which contains meaningful semantic information, rather than photo-realistic colors, in order to achieve a high class score. Alternatively, as seen in Figure 10, images optimized with an image prior may achieve high scores despite a lack of semantic information merely by finding sufficiently natural colors and textures.

3.3 Ensembling

Ensembling is an established tool often used in applications from enhanced inference (Opitz & Maclin, 1999) to dataset security (Souri et al., 2021). We find that optimizing an ensemble composed of different ColorShifts of the same image simultaneously improves the performance of inversion methods. To this end, we minimize the average of cross-entropy losses (f(𝐱i),y)\mathcal{L}(f(\mathbf{x}_{i}),y), where the 𝐱i{\mathbf{x}_{i}} are different ColorShifts of the image at the current step of optimization. Figure 4 shows the result of applying ensembling alongside ColorShift. We observe that larger ensembles appear to give slight improvements, but even ensembles of size 1 or two produce satisfactory results. This is important for models like ViTs, where available GPU memory constrains the possible size of this ensemble; in general, we use the largest ensemble size (up to a maximum of e=32e=32) that our hardware permits for a particular model. More results on the effect of ensemble size can be found in Figure 15. We show the effect of ensembling using other well-known augmentations and compare them to ColorShift in Appendix Section A.5, and observe that ColorShift is the strongest among augmentations we tried for model inversion.

3.4 The Plug-in Inversion Method

We combine the jitter, ensembling, ColorShift, centering, and zoom techniques, and name the result Plug-In Inversion, which references the ability to ‘plug in’ any differentiable model, including ViTs and MLPs, using a single fixed set of hyper-parameters. Full pseudocode for the algorithm may be found in appendix E. In the next section, we detail the experimental method that we used to find these hyper-parameters, after which we present our main results.

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
ResNet-101 ViT B-32 DeiT P16 224 Deit Dist P16 384 ConViT tiny
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Mixer b16 224 PiT Dist 224 ResMLP 36 Dist Swin P4 W12 Twin PCPVT
Figure 6: Images inverted from the ImageNet Volcano class for various Convolutional, Transformer, and MLP-based networks using PII. See figure 17 for further examples. For more details about networks, refer to Appendix B.
Barn Garbage Goblet Ocean CRT Warplane
Truck Liner Screen
ResNet-101 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
ViT B-32 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
DeiT Dist Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
ResMLP 36 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 7: Inverting different ImageNet model and class combinations for different classes using PII.
Apple Castle Dolphin Maple Road Rose Sea Seal Train
ViT L-16 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
ViT B-32 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
ViT S-32 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
ViT T-16 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 8: Inverting different CIFAR-100 model and class combinations using PII.
Plane Car Bird Cat Deer Dog Frog Horse Ship Truck
ViT L-32 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
ViT L-16 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
ViT B-32 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
ViT B-16 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 9: Inverting different every class of CIFAR-10 from different ViT models using PII.

4 Experimental Setup

In order to tune hyper-parameters of PII for use on naturally-trained models, we use the torchvision (Paszke et al., 2019) ImageNet-trained ResNet-50 model. We apply centering + zoom simultaneously in 7 ‘stages’. During each stage, we optimize the selected patch for 400 iterations, applying random jitter and ColorShift at each step. We use the Adam (Kingma & Ba, 2014) optimizer with momentum βm=(0.5,0.99)\beta_{m}=(0.5,0.99), initial learning rate lr=0.01lr=0.01, and cosine-decay. At the beginning of every stage, the learning rate and optimizer are re-initialized. We use α=β=1.0\alpha=\beta=1.0 for the ColorShift parameters, and an ensemble size of e=32e=32. Further ablation studies for these choices can be found in figures 11, 14, and 15.

All the models (including pre-trained weights) we consider in this work are publicly available from widely-used sources. Explicit details of model resources can be found in section B of the appendix. We also make the code used for all demonstrations and experiments in this work available at https://github.com/youranonymousefriend/plugininversion.

5 Results

5.1 PII works on a range of architectures

We now present the results of applying Plug-In Inversion to different types of models. We once again emphasize that we use identical settings for the PII parameters in all cases.

Figure 6 depicts images produced by inverting the Volcano class for a variety of architectures, including examples of CNNs, ViTs, and MLPs. While the quality of images varies somewhat between networks, all of them include distinguishable and well-placed visual information. Many more examples are found in Figure 17 of the Appendix.

In Figure 7, we show images produced by PII from representatives of each main type of architecture for a few arbitrary ImageNet classes. We note the distinct visual styles that appear in each row, which supports the perspective of model inversion as a tool for understanding what kind of information different networks learn during training.

Gown Microphone Mobile Home Schooner Cardoon Volcano
PII Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
PII + DeepInv Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
DeepInv Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 10: PII and DeepInversion results for a naturally-trained ResNet-50. The middle row represents performing PII and using the result as an initialization for DeepInversion.

5.2 PII works on other datasets

In Figure 8, we use PII to invert ViT models trained on ImageNet and fine-tuned on CIFAR-100. Figure 9 shows inversion results from models fine-tuned on CIFAR-10. We emphasize that these were produced using identical settings to the ImageNet results above, whereas other methods (like DeepInversion) tune dataset-specific hyperparameters.

5.3 Comparing PII to existing methods

To quantitatively evaluate our method, we invert both a pre-trained ViT model and a pretrained ResMLP model to produce one image per class using PII, and do the same using DeepDream (i.e., DeepInversion minus feature regularization, which is not available for this model). We then use a variety of pre-trained models to classify these images. Table 1(b) contains the mean top-1 and top-5 classification accuracies across these models, as well as Inception scores, for the generated images from each method. We see that our method is competitive with, and in the ViT case widely outperforms, DeepDream. Appendix G contains more details about these experiments.

Table 1: Inception score and mean classification accuracies of various models on images inverted from (a) ViT B-32 and (b) ResMLP 36 by PII and DeepDream. Higher is better in all fields.
Method Inception score Top-1 Top-5
PII 28.17 ±\pm 7.21 77.0% 89.5%
DeepDream 2.72 ±\pm 0.23 35.2% 49.6%
(a) Images inverted from ViT B-32
Method Inception score Top-1 Top-5
PII 6.79±2.186.79\pm 2.18 49.2% 62.0%
DeepDream 3.27±0.473.27\pm 0.47 51.3% 61.3%
(b) Images inverted from ResMLP 36

Figure 10 shows images from a few arbitrary classes produced by PII and DeepInversion. We additionally show images produced by DeepInversion using the output of PII, rather than random noise, as its initialization. Using either initialization, DeepInversion clearly produces images with natural-looking colors and textures, which PII of course does not. However, DeepInversion alone results in some images that either do not clearly correspond to the target class or are semantically confusing. By comparison, PII again produces images with strong spatial and semantic qualities. Interestingly, these qualities appear to be largely retained when applying DeepInversion after PII, but with the color and texture improvements that image priors afford (Mahendran & Vedaldi, 2015), suggesting that using these methods in tandem may be a way to produce even better inverted images from CNNs than either method independently.

Appendix F contains additional qualitative comparisons to DeepDream and DeepInversion, further illustrating the need for model-specific hyperparameter tuning in contrast to our method.

6 Conclusion

We studied the effect of various augmentations on the quality of class-inverted images and introduced Plug-In Inversion, which uses these augmentations in tandem. We showed that this technique produces intelligible images from a wide range of well-studied architectures and datasets, including the recently introduced ViTs and MLPs, without a need for model-specific hyper-parameter tuning. We believe that augmentation-based model inversion is a promising direction for future research in understanding computer vision models.

References

  • Chen et al. (2020) Chen, X., Fan, H., Girshick, R., and He, K. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020.
  • Chu et al. (2021) Chu, X., Tian, Z., Wang, Y., Zhang, B., Ren, H., Wei, X., Xia, H., and Shen, C. Twins: Revisiting the design of spatial attention in vision transformers. arXiv preprint arXiv:2104.13840, 1(2):3, 2021.
  • d’Ascoli et al. (2021) d’Ascoli, S., Touvron, H., Leavitt, M., Morcos, A., Biroli, G., and Sagun, L. Convit: Improving vision transformers with soft convolutional inductive biases. arXiv preprint arXiv:2103.10697, 2021.
  • Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.  248–255. Ieee, 2009.
  • Dosovitskiy & Brox (2016) Dosovitskiy, A. and Brox, T. Inverting visual representations with convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  4829–4837, 2016.
  • Dosovitskiy et al. (2021) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
  • Fredrikson et al. (2015) Fredrikson, M., Jha, S., and Ristenpart, T. Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, pp.  1322–1333, 2015.
  • He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  • Heo et al. (2021) Heo, B., Yun, S., Han, D., Chun, S., Choe, J., and Oh, S. J. Rethinking spatial dimensions of vision transformers. arXiv preprint arXiv:2103.16302, 2021.
  • Howard et al. (2019) Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  1314–1324, 2019.
  • Huang et al. (2017) Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  4700–4708, 2017.
  • Iandola et al. (2016) Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., and Keutzer, K. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
  • Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448–456. PMLR, 2015.
  • Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25:1097–1105, 2012.
  • LeCun et al. (1989) LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
  • Liu et al. (2021a) Liu, H., Dai, Z., So, D. R., and Le, Q. V. Pay attention to mlps. arXiv preprint arXiv:2105.08050, 2021a.
  • Liu et al. (2021b) Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030, 2021b.
  • Ma et al. (2018) Ma, N., Zhang, X., Zheng, H.-T., and Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), pp.  116–131, 2018.
  • Mahendran & Vedaldi (2015) Mahendran, A. and Vedaldi, A. Understanding deep image representations by inverting them. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  5188–5196, 2015.
  • Mejia et al. (2019) Mejia, F. A., Gamble, P., Hampel-Arias, Z., Lomnitz, M., Lopatina, N., Tindall, L., and Barrios, M. A. Robust or private? adversarial training makes models more vulnerable to privacy attacks. arXiv preprint arXiv:1906.06449, 2019.
  • Mordvintsev et al. (2015) Mordvintsev, A., Olah, C., and Tyka, M. Inceptionism: Going deeper into neural networks. 2015.
  • Olah et al. (2017) Olah, C., Mordvintsev, A., and Schubert, L. Feature visualization. Distill, 2017. doi: 10.23915/distill.00007. https://distill.pub/2017/feature-visualization.
  • Opitz & Maclin (1999) Opitz, D. and Maclin, R. Popular ensemble methods: An empirical study. Journal of artificial intelligence research, 11:169–198, 1999.
  • Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32:8026–8037, 2019.
  • Salimans et al. (2016) Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training gans. Advances in neural information processing systems, 29:2234–2242, 2016.
  • Sandler et al. (2018) Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  4510–4520, 2018.
  • Santurkar et al. (2019) Santurkar, S., Ilyas, A., Tsipras, D., Engstrom, L., Tran, B., and Madry, A. Image synthesis with a single (robust) classifier. Advances in Neural Information Processing Systems, 32:1262–1273, 2019.
  • Shafahi et al. (2019) Shafahi, A., Najibi, M., Ghiasi, A., Xu, Z., Dickerson, J., Studer, C., Davis, L. S., Taylor, G., and Goldstein, T. Adversarial training for free! arXiv preprint arXiv:1904.12843, 2019.
  • Simonyan & Zisserman (2014) Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • Simonyan et al. (2014) Simonyan, K., Vedaldi, A., and Zisserman, A. Deep inside convolutional networks: Visualising image classification models and saliency maps, 2014.
  • Souri et al. (2021) Souri, H., Goldblum, M., Fowl, L., Chellappa, R., and Goldstein, T. Sleeper agent: Scalable hidden trigger backdoors for neural networks trained from scratch. arXiv preprint arXiv:2106.08970, 2021.
  • Szegedy et al. (2015) Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1–9, 2015.
  • Tan et al. (2019) Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., and Le, Q. V. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2820–2828, 2019.
  • Tolstikhin et al. (2021) Tolstikhin, I., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., Yung, J., Keysers, D., Uszkoreit, J., Lucic, M., et al. Mlp-mixer: An all-mlp architecture for vision. arXiv preprint arXiv:2105.01601, 2021.
  • Touvron et al. (2021a) Touvron, H., Bojanowski, P., Caron, M., Cord, M., El-Nouby, A., Grave, E., Izacard, G., Joulin, A., Synnaeve, G., Verbeek, J., and Jégou, H. Resmlp: Feedforward networks for image classification with data-efficient training, 2021a.
  • Touvron et al. (2021b) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and Jegou, H. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, volume 139, pp.  10347–10357, July 2021b.
  • Touvron et al. (2021c) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., and Jégou, H. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pp. 10347–10357. PMLR, 2021c.
  • Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008, 2017.
  • Wightman (2019) Wightman, R. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
  • Xie et al. (2017) Xie, S., Girshick, R., Dollár, P., Tu, Z., and He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1492–1500, 2017.
  • Xu et al. (2021) Xu, W., Xu, Y., Chang, T., and Tu, Z. Co-scale conv-attentional image transformers. arXiv preprint arXiv:2104.06399, 2021.
  • Yin et al. (2020) Yin, H., Molchanov, P., Alvarez, J. M., Li, Z., Mallya, A., Hoiem, D., Jha, N. K., and Kautz, J. Dreaming to distill: Data-free knowledge transfer via deepinversion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8715–8724, 2020.
  • Zagoruyko & Komodakis (2016) Zagoruyko, S. and Komodakis, N. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
  • Zeiler & Fergus (2014) Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks. In European conference on computer vision, pp.  818–833. Springer, 2014.

Appendix A Additional Results

A.1 Ablation Study for α\alpha and β\beta

Figure 11 shows the results of PII when varying the values of α\alpha and β\beta, which determine the intervals from which the ColorShift constants are randomly drawn. Based on this and similar experiments, we permanently fix these parameters to α=β=1.0\alpha=\beta=1.0 for all other PII experiments, and find that these values indeed transfer well to other models.

β=0\beta=0 β=0.1\beta=0.1 β=0.5\beta=0.5 β=1.0\beta=1.0 β=2.0\beta=2.0 β=4.0\beta=4.0 β=8.0\beta=8.0
α=0\alpha=0 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
α=0.1\alpha=0.1 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
α=0.5\alpha=0.5 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
α=1.0\alpha=1.0 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
α=2.0\alpha=2.0 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
α=4.0\alpha=4.0 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
α=8.0\alpha=8.0 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 11: Effect of α\alpha and β\beta on the quality of the images generated by PII from a naturally-trained ResNet-50 for the Tench class.

A.2 Insensitivity to TV regularization

Figure 12 shows additional results on the effect of ColorShift on the sensitivity to the weight of TV regularization when inverting a robust model, complementing Figure 3. As in the earlier figure, we observe that certain values of λTV\lambda_{TV} may produce noisy or blurred images when not using ColorShift, whereas the ColorShift results are quite stable.

log(λtv):log(\lambda_{tv}): 9-9 8-8 7-7 6-6 5-5 4-4
w/ CS Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
w/o CS Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
w/ CS Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
w/o CS Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
w/ CS Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
w/o CS Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
w/ CS Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
w/o CS Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 12: Effect of TV with and without ColorShift. With ColorShift it is clear that there is no need for hyper-parameter tuning for parameters such as TV. Images from the robust ResNet-50.

A.3 Effect of Centering

Figures 13 and 14 show the effect of centering on inverting a robust and natural model, respectively.

Cen Not Cen Cen Not Cen Cen Not Cen Cen Not Cen
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Gold Finch Box Turtle Harvestman Black Widow
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Black Grouse Mergus Serrator Border Terrier Tiger Beetle
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Cricket Upright Piano Windsor Tie Volcano
Figure 13: Effect of using centering vs not using centering for a robust ResNet-50.
Cen Not Cen Cen Not Cen Cen Not Cen Cen Not Cen
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Bustard Otterhound Fly Macaque
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Clog Combination Lock Coffeepot Espresso Maker
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Shower Curtain TV Iron Mower
Figure 14: Effect of using centering vs not using centering for a naturally-trained ResNet-50.

A.4 Effect of Ensemble Size

Figure 15 gives additional results to those in figure 4 for the effect of ensemble size on inversion.

e=1e=1 e=2e=2 e=4e=4 e=8e=8 e=16e=16 e=32e=32 e=64e=64
403 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
283 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
449 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
460 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
558 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
802 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
834 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 15: Effect of ensemble size when inverting a robust ResNet-50. Even small values of ee give reasonably good results, but increasing ee tends to give slight improvement.

A.5 Effect of using other Augmentations

We used 4 random augmentations other than ColorShift to make comparisons. We used augmentations used in (Chen et al., 2020) with modifications. We use PyTorch (Paszke et al., 2019) notation to describe this part. We used RandomHorizontalFlip with 0.5 probability. We used RandomResizedCrop with scale [0.7, 1.], and ratio [0.75, and 1.33]. With applied ColorJitter with 0.8 probability, and brightness, contrast, saturation, and hue of (0.4, 0.4, 0.4, 0.1), respectively. We used RandomGrayscale with 0.2 probability. For this experiment, we do apply data normalization before feeding the input to the network. This is different than the regular experiment setting that we use for the robust model (see appendix C). The reason is that not having data normalization is similar to using ColorShift (it changes the data distribution which the model expects as an input).

No Aug Flip Crop Gray Color Jitter ColorShift
target=140 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
target=295 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
target=350 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
target=240 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
target=400 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
target=460 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 16: Effect of using different augmentations on inverting a robustly-trained ResNet-50 (top 3 rows) and a naturally-trained ResNet-50 (bottom 3 rows).

A.6 PII on additional networks

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
AlexNet DenseNet GoogLeNet MobileNet v2 MobileNet v3-l MobileNet v3-s MNasNet 0-5 MNasNet 1-0
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
ResNet 18 ResNet 34 ResNet 50 ResNet 101 ResNet 152 ResNext 50 ResNext 101 WResNet 50
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
WResNet 101 ShuffleNet v2-0-5 ShuffleNet v2-1-0 SqueezeNet VGG11-bn VGG13-bn VGG16-bn VGG19-bn
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
ViT B16 ViT B32 ViT L16 ViT L32 DeiT p16-224 DeiT-D p16-384 Deit p16-384 DeiT-D-t p16-224
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
DeiT-D-s p16-224 DeiT-D p16-224 CoaT-m CoaT-s CoaT-t ConViT ConViT-s ConViT-t
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Mixer 24-224 Mixer b16-224 Mixer l16-224 PiT-D b-224 PiT s-224 PiT-D s-224 Pit-D t-224 ResMLP 12-224
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
ResMLP-D 12-224 ResMLP 24-224 ResMLP-D 24-224 ResMLP 36-224 ResMLP-D 36-224 ResMLP b-24-224 ResMLP b-24-224-1k ResMLP-D b-24-224
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Swin w12-384 Swin s-w7-224 Twin pcpvt-b Twin pcpvt-l Twin pcpvt-s Twin svt-b Twin svt-l Twin svt-s
Figure 17: PII applied to various vision models for the Volcano class.

Figure 17 shows the results of Plug-In Inversion on various CNN, ViT, and MLP networks, adding to those shown in figure 6. See section B for model details.

Appendix B Models

In our experiments, we use publicly available pre-trained models from various sources. The following tables list the models used from each source, along with references to where they are introduced in the literature.

Alias Name Paper
ViT B16 B_16_imagenet1k (Dosovitskiy et al., 2021)
ViT B32 B_32_imagenet1k (Dosovitskiy et al., 2021)
ViT B-32 B_32_imagenet1k (Dosovitskiy et al., 2021)
ViT L16 L_16_imagenet1k (Dosovitskiy et al., 2021)
ViT L32 L_32_imagenet1k (Dosovitskiy et al., 2021)
Figure 18: Pre-trained models used from : https://github.com/lukemelas/PyTorch-Pretrained-ViT.
Alias Name Paper
DeiT p16-224 deit_base_patch16_224 (Touvron et al., 2021c)
DeiT P16 224 deit_base_patch16_224 (Touvron et al., 2021c)
Deit-D p16-384 deit_base_distilled_patch16_384 (Touvron et al., 2021c)
Deit Dist P16 384 deit_base_distilled_patch16_384 (Touvron et al., 2021c)
deit p16-384 deit_base_patch16_384 (Touvron et al., 2021c)
Deit-D-t p16-224 deit_tiny_distilled_patch16_224 (Touvron et al., 2021c)
Deit-D-s p16-224 deit_small_distilled_patch16_224 (Touvron et al., 2021c)
Deit-D p16-224 deit_base_distilled_patch16_224 (Touvron et al., 2021c)
Figure 19: Pre-trained models from (Touvron et al., 2021b) .
Alias Name Paper
AlexNet alexnet (Krizhevsky et al., 2012)
DenseNet densenet121 (Huang et al., 2017)
GoogLeNet googlenet (Szegedy et al., 2015)
MobileNet v2 mobilenet_v2 (Sandler et al., 2018)
MobileNet-v2 mobilenet_v2 (Sandler et al., 2018)
MobileNet v3-l mobilenet_v3_large (Howard et al., 2019)
MobileNet v3-s mobilenet_v3_small (Howard et al., 2019)
MNasNet 0-5 mnasnet0_5 (Tan et al., 2019)
MNasNet 1-0 mnasnet1_0 (Tan et al., 2019)
ResNet 18 resnet18 (He et al., 2016)
ResNet-18 resnet18 (He et al., 2016)
ResNet 34 resnet34 (He et al., 2016)
ResNet 50 resnet50 (He et al., 2016)
ResNet 101 resnet101 (He et al., 2016)
ResNet-101 resnet101 (He et al., 2016)
ResNet 152 resnet152 (He et al., 2016)
ResNext 50 resnext50_32x4d (Xie et al., 2017)
ResNext 101 resnext101_32x8d (Xie et al., 2017)
WResNet 50 wide_resnet50_2 (Zagoruyko & Komodakis, 2016)
WResNet 101 wide_resnet101_2 (Zagoruyko & Komodakis, 2016)
W-ResNet-101-2 wide_resnet101_2 (Zagoruyko & Komodakis, 2016)
ShuffleNet v2-0-5 shufflenet_v2_x0_5 (Ma et al., 2018)
ShuffleNet v2-1-0 shufflenet_v2_x1_0 (Ma et al., 2018)
ShuffleNet v2 shufflenet_v2_x1_0 (Ma et al., 2018)
SqueezeNet squeezenet1_0 (Iandola et al., 2016)
VGG11-bn vgg11_bn (Simonyan & Zisserman, 2014)
VGG13-bn vgg13_bn (Simonyan & Zisserman, 2014)
VGG16-bn vgg16_bn (Simonyan & Zisserman, 2014)
VGG19-bn vgg19_bn (Simonyan & Zisserman, 2014)
Figure 20: Pre-trained models from TorchVision: https://github.com/pytorch/vision.
Alias Name Paper
CoaT-m coat_lite_mini (Xu et al., 2021)
CoaT-s coat_lite_small (Xu et al., 2021)
CoaT-t coat_lite_tiny (Xu et al., 2021)
ConViT convit_base (d’Ascoli et al., 2021)
ConViT-s convit_small (d’Ascoli et al., 2021)
ConViT-t convit_tiny (d’Ascoli et al., 2021)
ConViT tiny convit_tiny (d’Ascoli et al., 2021)
Mixer 24-224 mixer_24_224 (Tolstikhin et al., 2021)
Mixer b16-224 mixer_b16_224 (Tolstikhin et al., 2021)
Mixer b16 224 mixer_b16_224 (Tolstikhin et al., 2021)
Mixer b16-224-mill mixer_b16_224_miil (Tolstikhin et al., 2021)
Mixer l16-224 mixer_l16_224 (Tolstikhin et al., 2021)
PiT-D b-224 pit_b_distilled_224 (Heo et al., 2021)
PiT Dist 224 pit_b_distilled_224 (Heo et al., 2021)
PiT s-224 pit_s_224 (Heo et al., 2021)
PiT-D s-224 pit_s_distilled_224 (Heo et al., 2021)
PiT-D t-224 pit_ti_distilled_224 (Heo et al., 2021)
ResMLP 12-224 resmlp_12_224 (Touvron et al., 2021a)
ResMLP-D 12-224 resmlp_12_distilled_224 (Touvron et al., 2021a)
ResMLP 24-224 resmlp_24_224 (Touvron et al., 2021a)
ResMLP-D 24-224 resmlp_24_distilled_224 (Touvron et al., 2021a)
ResMLP 36-224 resmlp_36_224 (Touvron et al., 2021a)
ResMLP-D 36-224 resmlp_36_distilled_224 (Touvron et al., 2021a)
ResMLP 36 Dist resmlp_36_distilled_224 (Touvron et al., 2021a)
ResMLP b-24-224 resmlp_big_24_224 (Touvron et al., 2021a)
ResMLP b-24-224-1k resmlp_big_24_224_in22ft1k (Touvron et al., 2021a)
ResMLP-D b-24-224 resmlp_big_24_distilled_224 (Touvron et al., 2021a)
Swin w7-224 swin_base_patch4_window7_224 (Liu et al., 2021b)
Swin l-w7-224 swin_large_patch4_window7_224 (Liu et al., 2021b)
Swin l-w12-384 swin_large_patch4_window12_384 (Liu et al., 2021b)
Swin w12-384 swin_base_patch4_window12_384 (Liu et al., 2021b)
Swin P4 W12 swin_base_patch4_window12_384 (Liu et al., 2021b)
Swin s-w7-224 swin_small_patch4_window7_224 (Liu et al., 2021b)
Swin t-w7-224 swin_tiny_patch4_window7_224 (Liu et al., 2021b)
Twin pcpvt-b twins_pcpvt_base (Chu et al., 2021)
Twin PCPVT twins_pcpvt_base (Chu et al., 2021)
Twins pcpvt-l twins_pcpvt_large (Chu et al., 2021)
Twins pcpvt-s twins_pcpvt_small (Chu et al., 2021)
Twins svt-b twins_svt_base (Chu et al., 2021)
Twins svt-l twins_svt_large (Chu et al., 2021)
Twins svt-s twins_svt_small (Chu et al., 2021)
Figure 21: Pre-trained models used from: (Wightman, 2019)

Appendix C Additional experimental setting

C.1 Robust models

We use a robust RestNet-50 (He et al., 2016) model free-trained (Shafahi et al., 2019) on the ImageNet dataset (Deng et al., 2009). The setting we use for inverting robust models is very similar to that of PII explained in section 4 except for some differences. Throughout the paper, we use centering for robust models unless otherwise is mentioned (like when we are examining the effect of zoom and centering themselves). We use 0.0005 to scale total variation in the loss function. Also, we do not apply the data normalization layer before feeding the input to the network. In PII experiment setting, we apply a random ColorShift at each optimization step to each element in the ensemble. In the robust setting, we do not update the ColorShift variables μ\mu, and σ\sigma for a fixed patch size, and we update these variables for the ensemble when we use a new patch size. Although using ColorShift would alleviate the need for using TV regularization as discussed in section 3.2, and illustrated in figure 3, we retain the TV penalty in our robust setting to make this setting more similar to that of previous inversion methods and to emphasize that it is a toy example for our ablation studies.

Appendix D Every Class of ImageNet Dataset Inverted

Refer to caption
Figure 22: Inversion of first 500 classes of ImageNet for the Robust Model.
Refer to caption
Figure 23: Inversion of second 500 classes of ImageNet for the Robust Model.

Appendix E Optimization algorithm

Algorithm 1 Optimization procedure for Plug-In Inversion

Input :

Model ff, class yy, final resolution RR, ColorShift parameters α,β\alpha,\beta, ‘ensemble’ size ee, randomly initialized 𝐱3×R/8×R/8\mathbf{x}\in\mathcal{I}^{3\times R/8\times R/8}

for s=1,,7s=1,\dots,7 do

       Upsample 𝐱\mathbf{x} to resolution (2s+1)R16×(2s+1)R16\frac{(2s+1)R}{16}\times\frac{(2s+1)R}{16}
Pad 𝐱\mathbf{x} with random noise to resolution (s+1)R8×(s+1)R8\frac{(s+1)R}{8}\times\frac{(s+1)R}{8}
for i=1,,400i=1,\dots,400 do
             𝐱=Jitter(𝐱)\mathbf{x}^{\prime}=\text{Jitter}(\mathbf{x})
for n=1,,en=1,\dots,e do
                   Draw μU(α,α)3\mu\sim U(-\alpha,\alpha)^{3}, σexp(U(β,β))3\sigma\sim\exp(U(-\beta,\beta))^{3}
𝐱n=ColorShiftμ,σ(𝐱)\mathbf{x}_{n}=\text{ColorShift}_{\mu,\sigma}(\mathbf{x}^{\prime})
            =1en=1eNLL(f(𝐱𝐧),y)\displaystyle\mathcal{L}=\frac{1}{e}\sum_{n=1}^{e}NLL(f(\mathbf{x_{n}}),y)
𝐱Adami(𝐱,𝐱)\mathbf{x}\leftarrow\text{Adam}_{i}(\mathbf{x},\nabla_{\mathbf{x}}\mathcal{L})
      return 𝐱\mathbf{x}

Appendix F Additional baseline comparisons

MobileNet-v2 ResNet-18 VGG16-bn W- ResNet-101-2 ShuffleNet-v2
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
ResNet-101 ViT B-32 DeiT P16 224 Deit Dist P16 384 ConViT tiny
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Mixer b16 224 PiT Dist 224 ResMLP 36 Dist Swin P4 W12 Twin PCPVT
Figure 24: Images inverted from the Volcano class for various Convolutional, Transformer, and MLP-based networks using DeepInversion (CNN models) / DeepDream (non-CNN models). Cross-reference figure 6.
Barn Garbage Goblet Ocean CRT Warplane
Truck Liner Screen
ResNet-101 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
ViT B-32 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
DeiT Dist Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
ResMLP 36 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 25: Inverting different model and class combinations for different classes using DeepInversion (top row) / DeepDream (other rows). Cross-reference figure 7.

Appendix G Quantitative Results

To quantitatively evaluate our method, we invert a pre-trained ViT model to produce one image per class using PII, and do the same using DeepDream (i.e., DeepInversion minus feature regularization, which is not available for this model). We then use a variety of pre-trained CNN, ViT, and MLP models to classify these images. We find that every model achieves strictly higher top-1 and top-5 accuracy on the PII-generated image set (excepting the ‘teacher’ model, which perfectly classifies both). We compile these results in figure 26. Additionally, we compute the Inception score (Salimans et al., 2016) for both sets of images, which also favors PII over DeepDream, with scores of 28.17±7.2128.17\pm 7.21 and 2.72±0.232.72\pm 0.23, respectively.

Refer to caption
(a)
Refer to caption
(b)
Figure 26: Top-1 (a) and top-5 (b) classification accuracy of various CNN, ViT, and MLP models evaluated on images generated from ViT B-32 using PII and DeepDream.

We also perform the same evaluation for images generated from a pre-trained ResMLP model. These results are more mixed; DeepDream images are classified much better by a small number of models, but the majority of models classify PII images better, and the average accuracy across models is approximately equal for both methods. Inception score, however, once again clearly favors PII over DeepDream, with scores of 6.79±2.186.79\pm 2.18 and 3.27±0.473.27\pm 0.47, respectively.

Refer to caption
(a)
Refer to caption
(b)
Figure 27: Top-1 (a) and top-5 (b) classification accuracy of various CNN, ViT, and MLP models evaluated on images generated from ResMLP 36-224 using PII and DeepDream.