This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Mini-batch graphs for robust image classification

Arnab Kumar Mondal
McGill University
[email protected]
Equal contribution.
   Vineet Jain11footnotemark: 1
Carnegie Mellon University
[email protected]
   Kaleem Siddiqi
McGill University
[email protected]
Abstract

Current deep learning models for classification tasks in computer vision are trained using mini-batches. In the present article, we take advantage of the relationships between samples in a mini-batch, using graph neural networks to aggregate information from similar images. This helps mitigate the adverse effects of alterations to the input images on classification performance. Diverse experiments on image-based object and scene classification show that this approach not only improves a classifier’s performance but also increases its robustness to image perturbations and adversarial attacks. Further, we also show that mini-batch graph neural networks can help to alleviate the problem of mode collapse in Generative Adversarial Networks.

1 Introduction

In recent years, supervised deep learning has had wide success in computer vision problems, including in object and scene classification [17, 32] and image segmentation [2, 13]. Models such as residual networks [10] have become standard and are now used as encoders to learn image based representations. In most of these settings, the training data is divided into mini-batches to adjust to limitations in computational and memory resources. Within a particular mini-batch, the input images may have varying degrees of similarity between them. Exploiting this variability during the feature encoding stage has the potential to improve the performance of downstream computer vision tasks.

Motivated by this idea, different approaches have been proposed to take advantage of the relationships between samples in a mini-batch for computer vision tasks, and in particular for image-based classification. These approaches all explicitly encourage the embeddings in the feature space to be close to one another when the underlying images are similar, by using an extra similarity based loss term. As an example, in [4] contrastive learning is used so that different augmentations of the same image within a mini-batch have a high degree of pair-wise similarity. In [15] this approach is extended to a supervised setting, such that the embeddings for instances within the same class are nearby. In a similar vein, in [27] the learning of representations is supervised by increasing the affinity between mini-batch samples that belong to the same class.

In the present article, we propose a more direct approach to information aggregation across each mini-batch of images, which is to use graph neural networks (GNNs) for this purpose. Our approach is based on the construction of a graph from each mini-batch of samples, which allows information to be pooled across those with similar features, using graph convolution operations. As such, the requirement that similar images have similar embeddings is implicit, in that no additional loss term has to be optimized. This allows for the aggregation of features during training in a manner that adjusts dynamically to each particular mini-batch ensemble of images. A perturbation analysis explains how this, in turn, affords a degree of robustness to input image alterations and adversarial attacks. We also show the connection of our mini-batch graph neural networks to mini-batch discrimination, a technique to mitigate the mode collapse problem faced during training of Generative Adversarial Networks (GANs) [7].

Our experiments show a consistent improvement over the baseline and other related approaches across multiple architectures and datasets for the particular task of image classification. They also reveal the robustness of our proposed model against input perturbations and adversarial attacks. Our proposed mini-batch graph learning method causes little computational overhead since it introduces only a small number of additional parameters for feature aggregation. Since the method is implemented as a modular layer (Figure 1) and training in mini-batches is not specific to image classification, with minor modifications, it can be used for other vision tasks including segmentation [2, 13], region proposal generation [20] and relationship modeling [31].

2 Related work

The modeling of relationships between samples in a mini-batch has already shown promise in computer vision tasks. In recent work, [27], the learning of affinity between samples is supervised by optimizing an affinity mass loss. Here the pairwise affinity between all samples in the mini-batch is considered, and the loss function encourages the model to increase the affinity between samples belonging to the same class.

A different approach to exploit relationships between samples while training in mini-batches is to use supervised contrastive learning [15]. Here, the normalized embeddings from samples in the same class are encouraged to be closer to one another than to the embeddings of samples from different classes. This approach is related to another self-supervised contrastive method [4], where a model is trained to identify samples in a mini-batch that are different augmentations of the same image. These methods demonstrate improvement in image classification performance over standard networks.

Motivated by the success of the above methods, in the present article we propose and evaluate a more flexible approach, which is to use graph neural networks (GNNs) during mini-batch training to aggregate features from similar samples. Distinct from affinity graphs [27] and contrastive learning [4, 15], the use of GNNs in our approach encourages similar images to have similar embeddings in an implicit manner, while removing the requirement of the need to optimize an additional loss term. A GNN learns node representations that reflect the local neighborhood by aggregating information across nodes.

GNNs were first proposed as deep learning architectures on graph-structured data in [8, 25] and have since been extended to include convolution operation on graphs in [3, 11, 5] or to combined locally connected regions in graphs[19]. The use of GNNs for semi-supervised classification was proposed in [16], following which several different variants of GNN models have been developed: Graph Attention Networks (GATs) [26], models to process graphs with edge information [6, 29] and GNNs that work under low homophily [33].

GNNs have also been successfully applied to several other computer vision problems, including few-shot learning [24, 22] and semi-supervised learning [35]. In [24], a GNN is used to propagate label information from labeled samples in the support towards the unlabeled query image. In contrast, in [22, 35], a fixed graph is used to propagate embedding and label information, respectively.

Refer to caption
Figure 1: An illustration of the proposed MBGNN model for mini-batch learning for image classification. Encoder representations are used to create an adjacency matrix based on k-nearest neighbours. The adjacency matrix defines a graph on a mini-batch of representations, which is then processed using graph convolution operations, as detailed in Section 3.

3 Mini-batch Graph Neural Networks

3.1 Proposed Method

Our network has two components: a feature encoder and a mini-batch graph neural network (MBGNN), as illustrated in Fig. 1. We obtain the encoder fθ()f_{\theta}(\cdot) by removing the final layer of a standard residual network, such as Resnet-50. Consider a typical training setup which takes a mini-batch of samples as input for a downstream classification task. We denote the input samples in a mini-batch as 𝒳={x1,x2,,xB}\mathcal{X}=\{x_{1},x_{2},...,x_{B}\}, where BB is the batch size. The encoder provides representations for each sample, hi(0)=fθ(xi),iBh_{i}^{(0)}=f_{\theta}(x_{i}),\forall i\in B. The MBGNN induces a graph on the set of encoded representations and processes them using graph convolution operations. We denote the representations in the lthl^{\text{th}} layer by the set (l)={h1(l),h2(l),,hB(l)}\mathcal{H}^{(l)}=\{h_{1}^{(l)},h_{2}^{(l)},...,h_{B}^{(l)}\}. To dynamically induce a graph on (l)\mathcal{H}^{(l)}, we obtain the adjacency matrix A(l)A^{(l)} by computing pairwise cosine similarity between representations and consider the top kk similar representations for each sample as its neighbours, removing self connections. The extent of the neighbourhood for each node can be controlled with the parameter kk. The layer-wise propagation rule of the MBGNN in vector form, is given by,

H¯(l)=H(l)W(l)+b(l)H(l+1)=σ(combine(H¯(l),(1/k)A(l)H¯(l))).\begin{split}\bar{H}^{(l)}&=H^{(l)}W^{(l)}+b^{(l)}\\ H^{(l+1)}&=\sigma\big{(}\text{combine}\big{(}\bar{H}^{(l)},(1/k)A^{(l)}\bar{H}^{(l)}\big{)}\big{)}.\end{split} (1)

where hi(l)h^{(l)}_{i} is stacked row-wise to form H(l)H^{(l)}, W(l)W^{(l)} is the weight matrix, b(l)b^{(l)} is the bias, and σ()\sigma(\cdot) is a non-linear function, usually ReLU. In equation (1), fself(l)=H¯(l)f_{\textrm{self}}^{(l)}=\bar{H}^{(l)} contains the ‘self’ information for each node, and fneigh(l)=(1/k)A(l)H¯(l)f_{\textrm{neigh}}^{(l)}=(1/k)A^{(l)}\bar{H}^{(l)} contains the ‘neighbour’ information, since it is based on the average of the encoded representations of the neighbors of each node, as reflected by the adjacency matrix A(l)A^{(l)}.

There are several methods to combine the self and neighbour information in the combine(,)\text{combine}(\cdot,\cdot) function, which are explained below.

Concatenation: Here the self features, fself(l)f_{\textrm{self}}^{(l)} and neighbour features fneigh(l)f_{\textrm{neigh}}^{(l)} are concatenated in the representation dimension. This allows the network the flexibility to learn separate weights for both.

Weighted Addition: This is a convex combination of self and neighbour features, which forces the network to use neighbour information. For α[0,1]\alpha\in[0,1], we have,

combine(fself(l),fneigh(l))=αfself(l)+(1α)fneigh(l).\text{combine}(f_{\textrm{self}}^{(l)},f_{\textrm{neigh}}^{(l)})=\alpha f_{\textrm{self}}^{(l)}+(1-\alpha)f_{\textrm{neigh}}^{(l)}. (2)

This reduces to a standard GCN formulation if we set α=1/k+1\alpha=1/{k+1}.

Drop Features: We propose a different method in to mitigate the effect of co-adaptation and to make the neighbours contribute meaningful information. During training, we drop the self features with probability pp and the neighbour features with probability 1p1-p. We then make the testing phase deterministic by taking the expected output feature, which leads to a similar expression as the sum combination:

combine(fself(l),fneigh(l))=pfself(l)+(1p)fneigh(l).\text{combine}(f_{\textrm{self}}^{(l)},f_{\textrm{neigh}}^{(l)})=pf_{\textrm{self}}^{(l)}+(1-p)f_{\textrm{neigh}}^{(l)}. (3)

We explore these different methods in our experiments.

Attention MBGNN

The model can also be made more expressive by using an attention mechanism while aggregating from neighbours, which we refer to as an Attn-MBGNN (see Fig. 1). This is done by changing the calculation of the adjacency matrix 𝒜(l)\mathcal{A}^{(l)} to incorporate an attention coefficient between nodes. Let αijn(l)\alpha^{n(l)}_{ij} denote the attention coefficient of node jj to ii for the nn-th attention head, which can be computed as,

αijn(l)=exp(ϕ(ψ(Wn(l)hi(l),Wn(l)hj(l))))m𝒩(i)exp(ϕ(ψ(Wn(l)hi(l),Wn(l)hm(l)))).\alpha_{ij}^{n(l)}=\frac{\exp\big{(}\phi\big{(}\psi(W_{n}^{(l)}h_{i}^{(l)},W_{n}^{(l)}h_{j}^{(l)})\big{)}\big{)}}{\sum_{m\in\mathcal{N}(i)}\exp\big{(}\phi\big{(}\psi(W_{n}^{(l)}h_{i}^{(l)},W_{n}^{(l)}h_{m}^{(l)})\big{)}\big{)}}. (4)

Here ϕ()\phi(\cdot) is a neural network, Wn(l)W_{n}^{(l)} is a trainable matrix and ψ\psi is absolute difference. This is similar to the attention coefficient used in GAT [26]. To form the weighted adjacency matrix, we first follow the same process as for MBGNN by considering top kk similar features based on cosine similarity to get the neighbourhood 𝒩(i)\mathcal{N}(i) for each node ii. To get the weighted adjacency matrix for the nn-th head we set 𝒜ijn(l)=αijn(l),j𝒩(i)\mathcal{A}^{n(l)}_{ij}=\alpha^{n(l)}_{ij},\forall j\in\mathcal{N}(i) and 𝒜ijn(l)=0,j𝒩(i)\mathcal{A}^{n(l)}_{ij}=0,\forall j\notin\mathcal{N}(i). We also remove self-connections by setting 𝒜iin(l)=0,i\mathcal{A}^{n(l)}_{ii}=0,\forall i. The vectorized layer-wise propagation rule of the Attn-MBGNN with NN attention heads then becomes

H¯(l)=H(l)W(l)+b(l)H(l+1)=σ(combine(H¯(l),1Nn=1N𝒜n(l)H¯(l))).\begin{split}\bar{H}^{(l)}&=H^{(l)}W^{(l)}+b^{(l)}\\ H^{(l+1)}&=\sigma\big{(}\text{combine}\big{(}\bar{H}^{(l)},\frac{1}{N}\sum_{n=1}^{N}\mathcal{A}^{n(l)}\bar{H}^{(l)}\big{)}\big{)}.\end{split} (5)

Once the intermediate representations (upto LL layers) are obtained either using MBGNN or Attn-MBGNN, we get the final logits for the ii-th input in the batch by,

i=Wfinal(hi(1)hi(2)hi(L)),\ell_{i}=W_{final}(h_{i}^{(1)}\,\|\,h_{i}^{(2)}\,\|\,\ldots\,\|\,h_{i}^{(L)}), (6)

where \| denotes concatenation in the feature dimension. This captures the local and global information separately and takes a weighted combination. This design choice has been shown to increase the representation power of GNNs [30], by leveraging different neighbourhood ranges to better enable structure-aware representation. We can now compute the class probabilities by taking the softmax, p(yi=k|xi)=exp(k)j=1Cexp(j)p(y_{i}=k|x_{i})=\frac{\exp(\ell_{k})}{\sum_{j=1}^{C}\exp(\ell_{j})} where CC is the total number of classes. We use cross entropy loss to train the encoder and the MBGNN model end-to-end,

=i=1Bk=1C𝟙yi=klogp(yi=k|xi).\mathcal{L}=-\sum_{i=1}^{B}\sum_{k=1}^{C}\mathbb{1}_{y_{i}=k}\log p(y_{i}=k|x_{i}). (7)
Evaluation settings

In the transductive setting, we predict the label for a single test image within a mini-batch graph consisting of one test image and multiple training images. In the inductive setting, we construct a graph on a full mini-batch consisting of test images and predict labels for all samples. We include results under both the settings.

3.2 Robustness to Input Perturbations

We now show how our MBGNN provides robustness in the face of perturbations in the input, which we have argued is a desirable property for downstream computer vision tasks.

Proposition 1.

Consider a neural network comprised of an encoder and a fully connected layer, gsup()g_{sup}(\cdot), and a MBGNN neural network consisting of an encoder followed by a graph convolutional layer where each node has kk neighbours, by gmbgnn()g_{mbgnn}(\cdot). For transductive prediction, consider an input sample xx, with some perturbation Δx\Delta x. Let the associated perturbations in logits be Δysup=gsup(x+Δx)gsup(x)\Delta y_{sup}=g_{sup}(x+\Delta x)-g_{sup}(x) and Δymbgnn=gmbgnn(x+Δx)gmbgnn(x)\Delta y_{mbgnn}=g_{mbgnn}(x+\Delta x)-g_{mbgnn}(x). Then, Δymbgnn=1k+1Δysup\Delta y_{mbgnn}=\frac{1}{k+1}\Delta y_{sup}.

Proof.

Let the encoder e()e(\cdot) output a vector e(x)e(x) for a given image xx, and denote a given batch of samples as {x1,x2,,xB}\{x_{1},x_{2},...,x_{B}\}. For the standard supervised model, denoting the weight matrix of the final layer as WW, we get the perturbation in the final pre-softmax logits, for perturbation in input xBx_{B}, as Δysup=WTe(xB+ΔxB)WTe(xB)=WT[e(xB+ΔxB)e(xB)]\Delta y_{sup}=W^{T}e(x_{B}+\Delta x_{B})-W^{T}e(x_{B})=W^{T}[e(x_{B}+\Delta x_{B})-e(x_{B})]. Now, consider an MBGNN model in which each sample is connected to kk other samples in the mini-batch and has the same weight matrix WW. For an MBGNN with self connections, using the standard GCN update rule, we have Δymbgnn=1k+1WT[e(xB+ΔxB)+j𝒩(B)e(xj)]1k+1WT[e(xB)+j𝒩(B)e(xj)]=1k+1WT[e(xB+ΔxB)e(xB)]=1k+1Δysup\Delta y_{mbgnn}=\frac{1}{k+1}W^{T}[e(x_{B}+\Delta x_{B})+\sum_{j\in\mathcal{N}(B)}e(x_{j})]-\frac{1}{k+1}W^{T}[e(x_{B})+\sum_{j\in\mathcal{N}(B)}e(x_{j})]=\frac{1}{k+1}W^{T}[e(x_{B}+\Delta x_{B})-e(x_{B})]=\frac{1}{k+1}\Delta y_{sup}

The above proposition effectively states that for any perturbation in the input, the corresponding perturbation in the output is inversely proportional to the number of neighbours, for each node in the mini-batch graph, when using MBGNNs as opposed to standard networks. Similarity based aggregation aids in transductive inference, where a prediction is made for a single corrupted test image within a mini-batch of randomly sampled uncorrupted training set images. In Section 4.4 we carry out experiments to verify this property of robustness to image perturbations.

We also test the robustness of the model against black-box adversarial attacks. These adversaries craft perturbations which cause the model to classify legitimate looking input images incorrectly. Black-box adversaries do not have access to the model parameters and the gradients, and must query the model to observe the output class probabilities. They query the model repeatedly with a chosen image, but perturbing it with each iteration, based on the results of the previous query. A model which has a lower attack success rate and/or requires a higher number of queries on average against an adversary before a successful attack, is considered more robust. We test the MBGNN model against two recently proposed and popular black-box adversarial attack methods, simBA [9] and Bandits-TD [12]. We choose these two methods since they use different methodologies - simBA uses local search in order to craft adversarial perturbations whereas Bandits-TD estimates the gradient by repeatedly querying the model to create the adversarial input. The results of these experiments are also provided in Section 4.4.

3.3 Connection to Mini-batch Discrimination

Generative Adversarial Networks (GANs), first introduced in [7], are a family of generative models that are used in several computer vision tasks including high resolution image generation[28], image super-resolution[18], domain adaptation[14, 34] and image compression [1]. GANs suffer from the problem of mode collapse, where the generated samples belong to a few modes in the dataset while still being successful at fooling the discriminator. This leads to a lack of diversity in the generated samples. One way to mitigate mode collapse in GANs is to use a technique known as mini-batch discrimination [23]. Here, instead of the discriminator being required to label individual samples as fake or real, it discriminates between an entire mini-batch of generated or real samples.

As it turns out, our proposed Attn-MBGNN can be interpreted as an extension of mini-batch discrimination to the classification task. To establish this connection, we present a variation of our MBGNN, which we refer to as a Mode Collapse MBGNN (MC-MBGNN). Rather than aggregating node features weighted by attention, as in the case of Attn-MBGNN layers, we aggregate the edge features in this model. Note that these edge features capture similarity and can be interpreted as the unnormalized attention values.

In MC-MBGNN, we modify the discriminator network by extracting features from real/fake images using the encoder, fθ()f_{\theta}(\cdot), and denote them by {h1,h2,,hB}\{h_{1},h_{2},...,h_{B}\} where hi=fθ(xi)h_{i}=f_{\theta}(x_{i}) and BB is the batch size. We induce complete graphs for both the batch of generated and real samples, and process them separately. The output for the nn-th aggregated edge feature for the ii-th sample in the mini-batch, which is computed in a manner similar to aggregating unnormalized attention weights for the nn-th head in Attn-MBGNN, is given by:

h¯in=j=1Bϕ(ψ(Wnhi,Wnhj)),\bar{h}_{i}^{n}=\sum_{j=1}^{B}\phi(\psi(W_{n}h_{i},W_{n}h_{j})), (8)

where ϕ()\phi(\cdot) is a neural network, WnW_{n} is a trainable matrix, and ψ\psi can either be concatenation or absolute difference. We can then concatenate the aggregated edge features with the independent node features and use a final layer to get the ii-th scalar output: oi=σ(Wfinal(hih¯i1.h¯iN)).o_{i}=\sigma(W_{final}(h_{i}\,\|\,\bar{h}_{i}^{1}\,\|\,....\,\|\,\bar{h}_{i}^{N})). where σ()\sigma(\cdot) is the sigmoid function. If we restrict ψ()\psi(\cdot) to absolute difference, ϕ()\phi(\cdot) to a fixed exp(x)\exp{}(-x) function and take all the h¯in\bar{h}_{i}^{n} as scalars then the model reduces to standard mini-batch discrimination [23]. This shows how MBGNNs are connected to mini-batch discrimination. Our MC-MBGNN model is more expressive than mini-batch discrimination, and can significantly mitigate mode collapse in GANs, as demonstrated by the experiments in Section 4.5.

4 Experiments

4.1 Datasets

We perform experiments on two standard computer vision object classification datasets, CIFAR-10 and CIFAR-100. In order to expand the scope of the experiments to include scene classification, which is more complex, we also test our models on the MIT 67 scene dataset. For our GAN experiments we use the CIFAR-10 and the CelebA datasets.
CIFAR-10 consists of 60,000 colour images in 10 classes, with 6,000 images per class. We use the standard split with 50,000 training images and 10,000 test images. Each image is 32 ×\times 32.
CIFAR-100 is similar to CIFAR-10, except that it has 100 classes, which significantly increases the difficulty of the classification task. Each class contains 600 images with 500 training images and 100 testing images per class. Each image is 32 ×\times 32.
MIT 67 contains indoor scene images belonging to 67 categories, with a total of 15620 images. The number of images varies across categories, but there are at least 100 images per category. For our experiments we reduce the image size by approximately a factor of 10 in each dimension to 64 ×\times 64, to ease the computational resources. As a consequence the scene classification task becomes harder.
CelebA is a large-scale face attributes dataset containing 10,177 identities with 202,599 colour face images in total. Each image is 64×\times64.

CIFAR 10 CIFAR 100 MIT 67
Model Inductive Transductive Inductive Transductive Inductive Transductive
Supervised vanilla 94.69±0.1794.69\scriptstyle{\pm 0.17} 74.48±0.2574.48\scriptstyle{\pm 0.25} 64.48±0.2764.48\scriptstyle{\pm 0.27}
Supervised contrastive [15] 94.85±0.1394.85\scriptstyle{\pm 0.13} 74.80±0.2274.80\scriptstyle{\pm 0.22} 65.10±0.1665.10\scriptstyle{\pm 0.16}
Affinity supervision [27] 94.45±0.3594.45\scriptstyle{\pm 0.35} 74.50±0.5974.50\scriptstyle{\pm 0.59} 64.60±0.3064.60\scriptstyle{\pm 0.30}
MBGNN (concat) 95.19±0.2395.19\scriptstyle{\pm 0.23} 95.21±0.2195.21\scriptstyle{\pm 0.21} 75.22±0.1775.22\scriptstyle{\pm 0.17} 75.15±0.2075.15\scriptstyle{\pm 0.20} 65.80±0.2165.80\scriptstyle{\pm 0.21} 65.81±0.1965.81\scriptstyle{\pm 0.19}
MBGNN (sum) 95.02±0.1995.02\scriptstyle{\pm 0.19} 95.02±0.1395.02\scriptstyle{\pm 0.13} 75.18±0.2175.18\scriptstyle{\pm 0.21} 75.20±0.1575.20\scriptstyle{\pm 0.15} 65.68±0.1765.68\scriptstyle{\pm 0.17} 65.70±0.1865.70\scriptstyle{\pm 0.18}
MBGNN (dropfeat) 95.24±0.19\bf 95.24\scriptstyle{\pm 0.19} 95.22±0.25\bf 95.22\scriptstyle{\pm 0.25} 75.21±0.1775.21\scriptstyle{\pm 0.17} 75.25±0.1975.25\scriptstyle{\pm 0.19} 65.82±0.1865.82\scriptstyle{\pm 0.18} 65.84±0.2065.84\scriptstyle{\pm 0.20}
Attn-MBGNN (concat) 95.14±0.1495.14\scriptstyle{\pm 0.14} 95.12±0.2195.12\scriptstyle{\pm 0.21} 75.16±0.2275.16\scriptstyle{\pm 0.22} 75.19±0.2575.19\scriptstyle{\pm 0.25} 65.86±0.15\bf 65.86\scriptstyle{\pm 0.15} 65.87±0.18\bf 65.87\scriptstyle{\pm 0.18}
Attn-MBGNN (sum) 94.95±0.1894.95\scriptstyle{\pm 0.18} 94.98±0.1594.98\scriptstyle{\pm 0.15} 75.23±0.1575.23\scriptstyle{\pm 0.15} 75.25±0.1775.25\scriptstyle{\pm 0.17} 65.72±0.1965.72\scriptstyle{\pm 0.19} 65.73±0.1765.73\scriptstyle{\pm 0.17}
Attn-MBGNN (dropfeat) 95.05±0.2595.05\scriptstyle{\pm 0.25} 95.06±0.2295.06\scriptstyle{\pm 0.22} 75.29±0.11\bf 75.29\scriptstyle{\pm 0.11} 75.26±0.18\bf 75.26\scriptstyle{\pm 0.18} 65.86±0.22\bf 65.86\scriptstyle{\pm 0.22} 65.85±0.2065.85\scriptstyle{\pm 0.20}
Table 1: Image classification results using a Resnet-50 encoder. The results are shown with 95% confidence intervals over 5 runs. The architectures are trained using a batch size of 256 and with kk = 16 for CIFAR-10, and kk = 4 for CIFAR-100 and MIT67. We provide results for different combine modes of our single layered mini-batch graph based models (rows 4-9). Additional results using a Wide Resnet-28-10 encoder are in the supplementary material.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 2: We encourage the reader to zoom-in on the PDF. Plots showing training and test accuracy versus epochs, for different models on the CIFAR-10 dataset using ResNet-50 encoder. The epochs axis is on a log scale. (a) training accuracy for MBGNNs, (b) test accuracy for MBGNNs, (c) training accuracy for attention MBGNNs, (d) test accuracy for attention MBGNNs.

4.2 Image classification results

Baselines. We use a ResNet-50 model trained using cross-entropy loss as the first baseline. We also reproduce the results of [15] with a batch size of 256 and reproduce results of [27] with our ResNet-50 baseline. We focus on relative improvement between our proposed model and these baselines.

For our proposed model, we use a ResNet-50 network without the final layer as an encoder and add an MBGNN with different combine methods. The entire model is trained end-to-end using cross-entropy loss. We provide the results for both MBGNN and Attn-MBGNN (with 95% confidence intervals over 5 runs) with all the combine options in Table 1, under both the inductive and the transductive settings. Our models perform better than all three baselines across the datasets considered. We observe that there is no significant difference in test accuracy between the inductive and transductive settings. Among the different combine methods, the drop feature seems to perform best in general; however, the difference in performance between these variations is small. Also, Attn-MBGNN is slightly better, owing to the model’s higher expressivity due to learnable attention weights. Figure 2 compares the training and test accuracy of all the models, with the standard supervised baseline versus the number of epochs on the CIFAR-10 dataset. The MBGNN models train faster than the standard network, giving a significant performance difference during early parts of the training process.

4.3 Ablation study: 𝐤\mathbf{k} and batch size

The most important parameters for graph-based learning are the size of the graph and the degree of each node, i.e., the neighborhood size kk. Here we use the ResNet-50 encoder and the CIFAR-10 dataset for all our experiments. We expect the best value for kk to be close to the batch size divided by the number of classes, since this is the expected number of samples in a mini-batch having the same class label. For CIFAR-10 with a batch size of 256256, this value is 256/1025\nicefrac{{256}}{{10}}\approx 25. We also expect performance to improve with larger batch sizes, since this translates to larger graphs, and hence more samples per class. We restrict ourselves to inductive predictions and provide the results for the drop feature model in Tables 2 and 3. The results match our prediction. First, for CIFAR-10 the best performance is indeed when kk is close to 25. Second, classification accuracy improves with larger batch sizes.

Model k=4 k=8 k=16 k=32 k=64 k=128 k=256
MBGNN (dropfeat) 94.91 95.10 95.24 95.22 95.02 94.11 92.88
Attn-MBGNN (dropfeat) 94.79 94.88 95.05 94.97 94.90 94.43 93.52
Table 2: Image classification results on CIFAR-10 using a Resnet-50 encoder and MBGNN, while varying the neighbourhood size kk. All the networks are trained with a batch size of 256 using a single layer GNN.
Model BS=32 BS=64 BS=128 BS=256 BS=512
MBGNN (dropfeat) 94.34 94.88 95.10 95.24 95.28
Attn-MBGNN (dropfeat) 94.38 94.82 94.95 95.05 95.12
Table 3: Image classification results on CIFAR-10 using a Resnet-50 encoder and an MBGNN, with different batch sizes. All the networks are trained with a neighbourhood size of kk = 16, using a single layer GNN.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 3: We encourage the reader to zoom-in on the PDF. Average test accuracy at different corruption severities for pixel-wise Gaussian noise (top) and Gaussian blurring (bottom) on CIFAR10, using a ResNet-50 encoder. Models using MBGNNs (purple) maintain higher accuracy over the range of corruption severities when compared to the baseline model (blue) and have a lower reduction in accuracy for higher corruption levels. (a) Sample images with increasing level of Gaussian noise, (b) and (c) test accuracy plots for Attn-MBGNN (sum) and Attn MBGNN (sum), (d) Sample images with increasing level of Gaussian blurring, (e) and (f) test accuracy plots for MBGNN (sum) and Attn MBGNN (sum). The plots for other models are provided in the supplementary material.

4.4 Robustness experiments

SimBA Bandits-TD
Model Mean Queries Median Queries Attack Success rate Mean Queries Median Queries Attack Success rate
Baseline ResNet-50 357.31 302 100.00 % 564.22 524 87.93 %
MBGNN, concat 508.53 388 99.79 % 768.86 620 87.77 %
MBGNN, sum 520.46 394 99.89 % 708.55 638 87.74 %
MBGNN, dropfeat 572.15 387 99.68 % 787.69 659 87.66 %
Attn MBGNN, concat 415.11 342 99.89 % 613.15 574 87.92 %
Attn MBGNN, sum 358.88 319 100.00 % 558.31 542 87.94 %
Attn MBGNN, dropfeat 392.19 338 99.79 % 590.17 580 87.84 %
Table 4: Black-box adversarial attack results for the baseline and the proposed models using simBA [9] and Bandits-TD [12] on the CIFAR10 dataset. A higher number of queries is better, and a lower attack success rate is better. Note that the mean and median number of queries is calculated over successful attacks only.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 4: Histograms of the number of queries required until a successful attack (over 1000 target images) on the CIFAR10 dataset using dropfeat MBGNN and Attn-MBGNN for simBA (top) and Bandits-TD (bottom). The queries axis is limited to 3000 queries for clarity of presentation. Models using MBGNNs (red) require more queries on average for a successful attack as compared to the baseline model (blue). The plots for other model variations are provided in the supplementary material.

In order to test the robustness of MBGNNs to image perturbations, we first consider random (pixel-wise) Gaussian noise and local Gaussian blurring on input images, with varying levels of corruption severity. The evaluation of test accuracy is done via transductive testing, where a mini-batch consists of a single corrupted image along with a training set of uncorrupted images. The class label prediction made by the model for the corrupted image is compared against the true label. Figure 3 shows plots of test accuracy, measured in the manner described above, for the best performing variations of MBGNN (sum) and Attn-MBGNN (sum) on the CIFAR-10 dataset for different levels of corruption severity. We observe that models using MBGNNs are far better at accommodating the effects of local perturbations to the images (Fig. 3 first row) and that although the effects of local Gaussian blur are less harmful, models based on MBGNNs are still better by about 2% over the range of corruption severities we have considered (Fig. 3 second row).

We also test the robustness of the model to two recently proposed and popular black-box adversarial attack methods, simBA [9], and Bandits-TD [12]. Table 4 shows the mean and median number of queries before a successful attack, and the attack success rate, for different MBGNN models for the two attack methods. The MBGNN models have slightly lower attack success rates and higher mean/median queries before a successful attack when compared to the baseline ResNet model. The large increase in mean queries can be attributed to a heavier tail in the distribution of queries, as can be seen from the histogram plots in Figure 4, which show results for the best performing variations of MBGNN (dropfeat) and Attn-MBGNN (dropfeat). Plots for other variations are provided in the supplementary material. The Attention MBGNN models outperform the baseline model but do not perform as well as the MBGNN models. One reason for this might be that any perturbation in the input samples has a compounding effect on calculating the attention weights and, therefore, the aggregation.

4.5 GAN training using MC-MBGNN

Refer to caption
(a) CelebA
Refer to caption
(b) CIFAR 10
Figure 5: Plots of NDB scores for 100 bins over the steps, where a step refers to 500 training iterations. Lower NDB scores imply higher sample diversity. We provide the confidence of the NDB values over 5 runs. We take 160 batches of real training images and 40 batches of fake generated images with a batch size of 128 to estimate the true statistics of the dataset and reduce the NDB test time.

Lastly, we provide results for training GANs using MC-MBGNN, and compare this strategy with both a vanilla GAN and minibatch discrimination [23]. We test the diversity of our generated samples using the technique of Number of statistically Different Bins (NDB), which was proposed in [21] as a metric to quantify mode collapse in GANs. To compute this metric, we first cluster the training dataset in KK different bins and then allocate the generated images in these bins based on their proximity to the centroid of each bin. Then, we measure the statistical similarity between the real and fake images in each of the bins and compute the fraction of statistically different bins that give us the NDB score. In the case of mode collapse, the number of statically different bins is close to KK, and the NDB score is close to 11. Using the NDB score as the metric, in Figure 5, we show that our proposed MC-MBGNN helps the generator learn faster and generate more diverse samples, which is substantiated by lower NDB scores. We provide the details of the architecture, experiment setup and generated images in the Supplementary material.

5 Discussion

The use of an MBGNN shows promise for computer vision tasks, and in particular, image classification. It improves raw performance and adds robustness to adversarial attacks and image perturbations, with low computational overhead. There are certain practical limitations of our method in its present form for training on large-scale datasets, which could be addressed in future work. First, from an implementation standpoint, parallelization requires further development. In our experiments, each mini-batch was handled by the same GPU. Second, the mini-batch size also needs to be at least a few times larger than the number of classes to ensure that multiple samples belong to the same class within a mini-batch during training. MBGNNs can be used with the most popular network models and require a modification of only their last layer. As such, we anticipate that they could find a use for diverse computer vision tasks beyond classification, such as segmentation, region proposal detection, image captioning, and relationship modeling.

References

  • [1] Eirikur Agustsson, Michael Tschannen, Fabian Mentzer, Radu Timofte, and Luc Van Gool. Generative adversarial networks for extreme learned image compression. In Proceedings of the IEEE International Conference on Computer Vision, pages 221–231, 2019.
  • [2] Rodrigo Benenson, Stefan Popov, and Vittorio Ferrari. Large-scale interactive object segmentation with human annotators. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [3] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.
  • [4] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709, 2020.
  • [5] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pages 3844–3852, 2016.
  • [6] Liyu Gong and Qiang Cheng. Exploiting edge features for graph neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9211–9219, 2019.
  • [7] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [8] Marco Gori, Gabriele Monfardini, and Franco Scarselli. A new model for learning in graph domains. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., volume 2, pages 729–734. IEEE, 2005.
  • [9] Chuan Guo, Jacob R. Gardner, Yurong You, A. Wilson, and Kilian Q. Weinberger. Simple black-box adversarial attacks. In ICML, 2019.
  • [10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [11] Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163, 2015.
  • [12] Andrew Ilyas, Logan Engstrom, and Aleksander Madry. Prior convictions: Black-box adversarial attacks with bandits and priors. In International Conference on Learning Representations, 2019.
  • [13] Or Isaacs, Oran Shayer, and Michael Lindenbaum. Enhancing generic segmentation with learned region representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • [14] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
  • [15] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. arXiv preprint arXiv:2004.11362, 2020.
  • [16] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
  • [17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, page 1097–1105, Red Hook, NY, USA, 2012. Curran Associates Inc.
  • [18] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690, 2017.
  • [19] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks for graphs. volume 48 of Proceedings of Machine Learning Research, pages 2014–2023, New York, New York, USA, 20–22 Jun 2016. PMLR.
  • [20] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
  • [21] Eitan Richardson and Yair Weiss. On gans and gmms. In Advances in Neural Information Processing Systems, pages 5847–5858, 2018.
  • [22] Pau Rodríguez, Issam Laradji, Alexandre Drouin, and Alexandre Lacoste. Embedding propagation: Smoother manifold for few-shot classification. arXiv preprint arXiv:2003.04151, 2020.
  • [23] Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. CoRR, abs/1606.03498, 2016.
  • [24] Victor Garcia Satorras and Joan Bruna Estrach. Few-shot learning with graph neural networks. In International Conference on Learning Representations, 2018.
  • [25] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2008.
  • [26] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations, 2018.
  • [27] Chu Wang, Babak Samari, Vladimir G Kim, Siddhartha Chaudhuri, and Kaleem Siddiqi. Affinity graph supervision for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8247–8255, 2020.
  • [28] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8798–8807, 2018.
  • [29] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (tog), 38(5):1–12, 2019.
  • [30] Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. Representation learning on graphs with jumping knowledge networks. arXiv preprint arXiv:1806.03536, 2018.
  • [31] Ji Zhang, Mohamed Elhoseiny, Scott Cohen, Walter Chang, and Ahmed Elgammal. Relationship proposal networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5678–5686, 2017.
  • [32] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6):1452–1464, 2018.
  • [33] Jiong Zhu, Yujun Yan, Lingxiao Zhao, Mark Heimann, Leman Akoglu, and Danai Koutra. Generalizing graph neural networks beyond homophily. arXiv preprint arXiv:2006.11468, 2020.
  • [34] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.
  • [35] Chengxu Zhuang, Xuehao Ding, Divyanshu Murli, and Daniel Yamins. Local label propagation for large-scale semi-supervised learning, 2019.

6 Supplementary Material

6.1 Results for WideResnet

Table 5 provides results for our MBGNN with all the combine options, using Wide ResNet-28-10 as the encoder network. We also use this network for all the baselines in this table. We observe a consistent improvement across all datasets, which is consistent with the results of our experiments using ResNet-50 in the main paper.

CIFAR 10 CIFAR 100 MIT 67
Model Inductive Transductive Inductive Transductive Inductive Transductive
Supervised vanilla 95.62±0.1495.62\scriptstyle{\pm 0.14} 79.58±0.2079.58\scriptstyle{\pm 0.20} 66.20±0.1966.20\scriptstyle{\pm 0.19}
Supervised contrastive [15] 95.91±0.1695.91\scriptstyle{\pm 0.16} 80.15±0.1580.15\scriptstyle{\pm 0.15} 66.89±0.1766.89\scriptstyle{\pm 0.17}
Affinity supervision [27] 95.59±0.1895.59\scriptstyle{\pm 0.18} 79.8±0.1979.8\scriptstyle{\pm 0.19} 66.8±0.1666.8\scriptstyle{\pm 0.16}
MBGNN (concat) 96.02±0.2196.02\scriptstyle{\pm 0.21} 96.05±0.2096.05\scriptstyle{\pm 0.20} 80.18±0.1880.18\scriptstyle{\pm 0.18} 80.20±0.1880.20\scriptstyle{\pm 0.18} 66.87±0.2266.87\scriptstyle{\pm 0.22} 66.90±0.2166.90\scriptstyle{\pm 0.21}
MBGNN (sum) 95.78±0.1595.78\scriptstyle{\pm 0.15} 95.80±0.1495.80\scriptstyle{\pm 0.14} 79.85±0.1779.85\scriptstyle{\pm 0.17} 79.88±0.1579.88\scriptstyle{\pm 0.15} 66.52±0.1866.52\scriptstyle{\pm 0.18} 66.51±0.1766.51\scriptstyle{\pm 0.17}
MBGNN (dropfeat) 96.14±0.16\bf 96.14\scriptstyle{\pm 0.16} 96.17±0.18\bf 96.17\scriptstyle{\pm 0.18} 80.46±0.2180.46\scriptstyle{\pm 0.21} 80.45±0.2080.45\scriptstyle{\pm 0.20} 67.10±0.1967.10\scriptstyle{\pm 0.19} 67.12±0.2067.12\scriptstyle{\pm 0.20}
Attn-MBGNN (concat) 95.95±0.1595.95\scriptstyle{\pm 0.15} 95.95±0.1895.95\scriptstyle{\pm 0.18} 80.22±0.2080.22\scriptstyle{\pm 0.20} 80.23±0.1980.23\scriptstyle{\pm 0.19} 66.96±0.1866.96\scriptstyle{\pm 0.18} 66.98±0.1866.98\scriptstyle{\pm 0.18}
Attn-MBGNN (sum) 95.89±0.1995.89\scriptstyle{\pm 0.19} 95.91±0.1895.91\scriptstyle{\pm 0.18} 80.02±0.2080.02\scriptstyle{\pm 0.20} 80.06±0.1880.06\scriptstyle{\pm 0.18} 66.67±0.1666.67\scriptstyle{\pm 0.16} 66.69±0.1766.69\scriptstyle{\pm 0.17}
Attn-MBGNN (dropfeat) 96.12±0.2096.12\scriptstyle{\pm 0.20} 96.14±0.2196.14\scriptstyle{\pm 0.21} 80.76±0.17\bf 80.76\scriptstyle{\pm 0.17} 80.76±0.15\bf 80.76\scriptstyle{\pm 0.15} 67.25±0.22\bf 67.25\scriptstyle{\pm 0.22} 67.26±0.20\bf 67.26\scriptstyle{\pm 0.20}
Table 5: Image classification results using a Wide ResNet-28-10 encoder. The architectures are trained using a batch size of 256 and with kk = 16 for CIFAR-10, and kk = 4 for CIFAR-100 and MIT67. We provide results for different combine modes of our single layered mini-batch graph based models (rows 4-9).

6.2 Additional robustness results

6.2.1 Random Perturbations

We provide results for Gaussian noise and Gaussian blurring perturbation for all the different variations of the MBGNN models, using the ResNet-50 encoder on the CIFAR-10 dataset. Figure 6 and 8 show some sample CIFAR-10 images, with different levels of corruption severity for visualization. For both Gaussian noise and Gaussian blurring we define corruption severity as the standard deviation, σ\sigma used when sampling from the Gaussian distribution, with higher values of σ\sigma corresponding to increased corruption.

Figure 7 shows the average test accuracy for different levels of Gaussian noise and Figure 9 shows the average test accuracy for different levels of Gaussian blurring. In both figures, the top row shows plots for MBGNN models (concat, sum and dropfeat) and the bottom row shows plots for Attn-MBGNN models (concat, sum and dropfeat). Models using MBGNNs (purple) maintain higher accuracy over the entire range of corruption severities as compared to the baseline model (blue) and show a lower drop in accuracy for higher corruption levels. We also observe that the sum combination method generally performs better than the other variations.

6.2.2 Black-box Adversarial Attacks

We provide histogram plots of the number of queries required until a successful attack (over 1000 images) for all the variations of the MBGNN model. Figure 10 and 11 show the plots for SimBA [9] and Bandits-TD [12], respectively. The dashed lines indicate the median value and the dotted lines indicate the mean value for the different models. In both figures, the top row shows plots for MBGNN models (concat, sum and dropfeat) and the bottom row shows plots for Attn-MBGNN models (concat, sum and dropfeat). The performance across the different combination methods is similar, although dropfeat models generally have a higher number of mean and median queries, due to the heavier tail in the distribution. Also, the MBGNN models with attention mechanism have a lower number of mean and median queries on average than the models without attention.

6.3 Details of the GAN architecture

We use standard generator and discriminator architectures for our GAN model. Let us denote the following operations,
Basic convolution: Conv(in_channels,out_channels,filter_size,stride,padding)Conv(in\_channels,out\_channels,filter\_size,stride,padding),
Linear layer: Linear(in_dim,out_dim)Linear(in\_dim,out\_dim),
Deconvolution: Conv_trans(in_channels,out_channels,filter_size,stride,padding,output_padding)Conv\_trans(in\_channels,out\_channels,filter\_size,stride,padding,output\_padding)
The generator architecture using the above notation, is given by
Linear(128,16384)batch_normReLUConv_trans(1024,512,5,2,2,1)batch_normReLUConv_trans(512,256,5,2,2,1)batch_normReLUConv_trans(256,128,5,2,2,1)batch_normReLUConv_trans(128,64,5,2,2,1)batch_normReLUConv_trans(64,3,5,1,2,0)tanhLinear(128,16384)-batch\_norm-ReLU\\ Conv\_trans(1024,512,5,2,2,1)-batch\_norm-ReLU\\ Conv\_trans(512,256,5,2,2,1)-batch\_norm-ReLU\\ Conv\_trans(256,128,5,2,2,1)-batch\_norm-ReLU\\ Conv\_trans(128,64,5,2,2,1)-batch\_norm-ReLU\\ Conv\_trans(64,3,5,1,2,0)-tanh
The discriminator architecture is given by
Conv(3,64,4,2,1)batch_normLeakyReLUConv(64,128,4,2,1)batch_normLeakyReLUConv(128,256,4,2,1)batch_normLeakyReLUConv(256,512,4,2,1)batch_normLeakyReLULinear(8192,1)sigmoidConv(3,64,4,2,1)-batch\_norm-LeakyReLU\\ Conv(64,128,4,2,1)-batch\_norm-LeakyReLU\\ Conv(128,256,4,2,1)-batch\_norm-LeakyReLU\\ Conv(256,512,4,2,1)-batch\_norm-LeakyReLU\\ Linear(8192,1)-sigmoid
This defines the GAN architecture for the CelebA dataset. For CIFAR-10, we slightly modify the Linear()Linear(\cdot) layers to adjust to the reduced image sizes. We add both the minibatch discrimination and MC-MBGNN layers before the final Linear()Linear(\cdot) layer of the discriminator. For the sake of our experiments, we use the Adam optimizer with a learning rate of 10410^{-4} for the discriminator and 2×1042\times 10^{-4} for the generator, with a batch size of 128128. We provide the generated images using different discriminator architectures in Figure 12 and Figure 13. We train the model on CIFAR-10 for 1000010000 iterations and on CelebA for 75007500 iterations.

Refer to caption
Figure 6: Sample images from the CIFAR-10 dataset with each column showing an increasing level of corruption severity, for pixelwise Gaussian noise.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 7: Average test accuracy at different corruption severities for Gaussian noise on CIFAR10, using ResNet-50 with MBGNNs (top) and ResNet-50 with Attention MBGNNs (bottom). Models using MBGNNs (purple) maintain higher accuracy over the entire range of corruption severities as compared to the baseline model (blue), and show a lower drop in accuracy for higher corruption levels.
Refer to caption
Figure 8: Sample images from CIFAR-10 dataset with each column showing increasing level of corruption severity for Gaussian blurring.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 9: Average test accuracy at different corruption severities for Gaussian blurring on CIFAR10, using ResNet-50 with MBGNNs (top) and ResNet-50 with Attention MBGNNs (bottom). Models using MBGNNs (purple) maintain higher accuracy over the range of corruption severities as compared to baseline model (blue) and have lower drop in accuracy for higher corruption levels.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 10: Histogram of number of queries required until a successful attack (over 1000 target images) using simBA on the CIFAR10 dataset. The queries axis is limited to 3000 queries for clarity of presentation. Models using MBGNNs (red) require a larger number of queries on average for a successful, attack as compared to the baseline model (blue). (a) MBGNN, concat, (b) MBGNN, sum, (c) MBGNN, dropfeat, (d) Attn-MBGNN, concat, (e) Attn-MBGNN, sum, (f) Attn-MBGNN, dropfeat.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 11: Histogram of number of queries required until a successful attack (over 1000 target images) using Bandits-TD on the CIFAR10 dataset. The queries axis is limited to 3000 queries for clarity of presentation. Models using MBGNNs (red) require a larger number of queries on average for a successful attack, as compared to the baseline model (blue). (a) MBGNN, concat, (b) MBGNN, sum, (c) MBGNN, dropfeat, (d) Attn-MBGNN, concat, (e) Attn-MBGNN, sum, (f) Attn-MBGNN, dropfeat.
Refer to caption
(a) Minibatch Discrimination
Refer to caption
(b) Proposed MC-MBGNN
Figure 12: Samples generated by the Generator for the CelebA dataset.
Refer to caption
(a) Minibatch Discrimination
Refer to caption
(b) Proposed MC-MBGNN
Figure 13: Samples generated by the Generator for the CIFAR-10 dataset.