CNT (Conditioning on Noisy Target): A new Algorithm for Leveraging Top-Down Feedback

Alexia Jolicoeur-Martineau Alex Lamb Vikas Verma Aniket Didolkar

(December 2021)

CNT (Conditioning on Noisy Targets): A new Algorithm for Leveraging Top-Down Feedback

Alexia Jolicoeur-Martineau Alex Lamb Vikas Verma Aniket Didolkar

(December 2021)

Abstract

We propose a novel regularizer for supervised learning called Conditioning on Noisy Targets (CNT). This approach consists in conditioning the model on a noisy version of the target(s) (e.g., actions in imitation learning or labels in classification) at a random noise level (from small to large noise). At inference time, since we do not know the target, we run the network with only noise in place of the noisy target. CNT provides hints through the noisy label (with less noise, we can more easily infer the true target). This give two main benefits: 1) the top-down feedback allows the model to focus on simpler and more digestible sub-problems and 2) rather than learning to solve the task from scratch, the model will first learn to master easy examples (with less noise), while slowly progressing toward harder examples (with more noise).

regularization, progressive learning, classification, imitation learning

1 Introduction

Refer to caption — Figure 1: This figure illustrates CNT in a “cat” versus “dog” classification problem. During training, we help the model by giving it a hint which corresponds to the noisy target. When the noise is small, the model can "cheat" and tell apart the true target. But as the noise increases, the model is given less hint and must rely more and more on the input image. During inference, the target is unknown, so the model is only given noise (no hint).

Astonishingly, deep learning can solve complex high-dimensional problems from a randomly initialized neural network without any prior experience/knowledge. However, the path from a random network to one that can effectively solve a problem is long and arduous. Rather than learning to solve a complex problem from scratch, humans tend to learn progressively: by starting from easy tasks and slowly increasing their difficulty.

Progressive learning methods have also been used in deep learning due to their benefits in improving model generalization (Fayek et al., 2018). Curriculum learning consists in ordering the samples of the dataset in order of difficulty to help learn progressively (Bengio et al., 2009). Gradual transfer learning consists in moving from easier to harder datasets (Fayek et al., 2018). Progressive growing consists in growing the neural network over time to replicate neural growth (Terekhov et al., 2015; Rusu et al., 2016) or to solve easier tasks before solving harder tasks (Karras et al., 2017).

Although progressive learning methods can improve generalization, these methods are often non-trivial to incorporate. Progressive growing requires a complex architecture design and extensive hyperparameter tuning. Curriculum learning requires access to a model that estimates the difficulty of each sample so that we can train a network on samples of increasing difficulty. Gradual transfer learning requires multiple datasets sorted by difficulty and extensive architecture/hyper-parameter tuning.

In this work, we seek to help the model progress from easier to harder tasks by providing it hints during training. Ideally, we would like such a method to be easy to incorporate within any given problem and not require hyper-parameter tuning. Given the strong evidence that progressive learning methods can improve generalization (Bengio et al., 2009; Fayek, 2018), we believe that providing hints to the network would improve generalization in supervised learning problems and be easier to incorporate than most existing progressive learning methods.

This raises the question: How can one provide hints about the solution to a neural network? We propose to give hints to the neural network by conditioning the neural network on a noisy version of the target(s) (e.g., actions in imitation learning or labels in classification). When adding very little noise to the target, the true target can be easily deduced by the neural network thus making the problem easier. Meanwhile, when adding a large amount of noise to the target, the true target is harder to infer from the noisy target. This provides a path for the neural network to learn to solve easier problems (noisy target with less noise) before harder problems (noisy target with more noise).

At the same time, we may consider why a noisy-version of the target is appropriate as a “hint” for training. An intuitive reason is that the target contains high-level and semantically meaningful top-down information, which is relevant for perception. The role of top-down feedback in perception has been studied in the cognitive science literature (Rauss and Pourtois, 2013; Kinchla and Wolfe, 1979) as a way of improving robustness to challenging or unreliable bottom-up signals. For example, a person may be able to navigate in a dark room that they are familiar with by leveraging priors about where objects are expected to be present. In our case, a model may be able to more easily make predictions early in training when it has access to noisy-targets, which indicate a handful of target values which are the most likely.

Adding noise to the target and conditioning the model on this noisy target should be straightforward. However, progressively increasing the noise according to a schedule over time is complicated; doing so would require heavy hyper-parameter tuning and move us away from our goal of having a simple tuning-free method.

Instead of a progressive approach, we take a multi-scale approach. An example of a multi-scale approach would be the Multi-Scale Gradient (MSG) approach by Karnewar and Iyengar (2019). Instead of progressively growing an architecture to generate images from low-resolution to high-resolution (Karras et al., 2017), MSG proposed to generate images from all resolutions simultaneously. MSG is now used in StyleGAN2 (Karras et al., 2020) instead of progressive growing because it enhances the stability and quality of the generated samples. The main benefit of a multi-scale approach is that the model is given gradients at all levels of difficulty, which means that it cannot unlearn solving the simpler tasks (generating lower-resolution images with MSG) after learning to solve the harder tasks (generating higher-resolution images with MSG).

Given the benefit and simplicity of the multi-scale approach, we use it instead of progressively increasing the noise. First, we sample uniformly from a continuous noise level ( $t\sim\mathcal{U}([0,1])$ ) where $t=0$ means no noise, and $t=1$ means only noise. Then, we inject an embedding on the noise level and the noisy label (hint) into the network. At test time, instead of sampling from the uniform distribution, we take $t=1$ , which means we have no information about the label and must start without any hint.

We call this technique Conditioning on Noisy Targets (CNT). CNT helps the network learn from multiple difficulties simultaneously and provides a continuous path for the model to progress from maximum hints (no noise) to no hints (only noise). We illustrate our approach in Figure 1 and highlights how it learn progressively in a multi-scale fashion in Figure 2. Our contribution is the Conditioning on Noisy Target (CNT) approach, a multi-scale task-agnostic regularizer providing top-down feedback to the neural network to improve the efficiency of learning. This approach is easy to implement as it only requires adding a small embedding network (for the noisy output and noise-level) and injecting the embedding in conditional normalization layers, which replace the usual normalization layers; no other changes are needed. Our approach can be applied to any neural network on any supervised learning problem and requires no hyper-parameter tuning. We test the validity of the proposed method on various different tasks like using image labels as a target for image classification problems, using the information about actions as a target for reinforcement learning problems.

2 From Supervised learning to CNT

We want to solve a supervised learning problem:

\min_{\theta}\mathbb{E}_{p(\boldsymbol{\mathrm{x}},\boldsymbol{\mathrm{y}})}\left[L(f_{\theta}(\boldsymbol{\mathrm{x}}),\boldsymbol{\mathrm{y}})\right],

(1)

where $L$ is an arbitrary loss function, $f_{\theta}$ is a neural network with parameters $\theta$ , $(\boldsymbol{\mathrm{x}},\boldsymbol{\mathrm{y}})$ is a pair of data and target sampled from a dataset.

For generality, we consider $\boldsymbol{\mathrm{y}}\in\mathbb{R}^{CL}$ to be a real-valued vector where $C$ is the number of classes and $L$ is the number of labels. In the case of multi-class classification, we use the one-hot vector representation as $\boldsymbol{\mathrm{y}}$ . In the multi-label case, we concatenate the one-hot vectors for all of the labels together to construct $\boldsymbol{\mathrm{y}}$ .

With most loss functions used in supervised learning (e.g., least-squares and cross-entropy), the optimal model returns the conditional expectation of $\boldsymbol{\mathrm{y}}$ given $\boldsymbol{\mathrm{x}}$ :

f_{\theta}^{*}(\boldsymbol{\mathrm{x}})=\mathbb{E}_{p(\boldsymbol{\mathrm{y}}|\boldsymbol{\mathrm{x}})}\left[\boldsymbol{\mathrm{y}}|\boldsymbol{\mathrm{x}}\right].

(2)

2.1 Noisy target

Instead of having $\boldsymbol{\mathrm{y}}$ be the target, we now want to replace it with a noisy target $\{\boldsymbol{\mathrm{y}}(t)\}_{t=0}^{1}$ , so that $\boldsymbol{\mathrm{y}}(0)$ is the true target and $\boldsymbol{\mathrm{y}}(1)$ is pure noise. Let $(\boldsymbol{\mathrm{x}},\boldsymbol{\mathrm{y}}(0))$ be a pair of data/target sampled from the data distribution. Any type of time-dependent noise could be used. Following the recent success of score-based diffusion models (Ho et al., 2020; Song et al., 2020), we chose to gradually corrupt the target over time using a diffusion process:

\mathrm{d}\boldsymbol{\mathrm{y}}=f(\boldsymbol{\mathrm{y}},t)\mathrm{d}t+g(t)\mathrm{d}\boldsymbol{\mathrm{w}},

(3)

where $f(\boldsymbol{\mathrm{y}},t):\mathbb{R}^{d}\times\mathbb{R}\to\mathbb{R}^{d}$ is the drift, $g(t):\mathbb{R}\to\mathbb{R}$ is the diffusion coefficient and $\boldsymbol{\mathrm{w}}(t)$ is the Wiener process indexed by $t\in[0,1]$ .

The functions $f$ and $g$ are chosen so that $\boldsymbol{\mathrm{y}}(0)$ is the true label ( $f(\boldsymbol{\mathrm{y}},0))=g(\boldsymbol{\mathrm{y}},0)=0$ ) and $\boldsymbol{\mathrm{y}}(1)$ is independent from $\boldsymbol{\mathrm{y}}(0)$ so that $\boldsymbol{\mathrm{y}}(1)$ can be sampled from even when $\boldsymbol{\mathrm{y}}(0)$ is unknown (at inference time). Given the good empirical results from the Variance Preserving (VP) process in score-based diffusion models (Song et al., 2020), we use it as our diffusion process; it is defined as follows:

\mathrm{d}\boldsymbol{\mathrm{y}}=-\frac{1}{2}\beta(t)\boldsymbol{\mathrm{y}}\mathrm{d}t+\sqrt{\beta(t)}\mathrm{d}\boldsymbol{\mathrm{w}}.

(4)

We can create noisy targets from the VP process by sampling from the following distribution:

\boldsymbol{\mathrm{y}}(t)|\boldsymbol{\mathrm{y}}(0)\sim\mathcal{N}(\boldsymbol{\mathrm{y}}(0)~{}e^{-\frac{1}{2}\int_{0}^{t}\beta(s)\mathrm{d}s},(1-e^{-\int_{0}^{t}\beta(s)\mathrm{d}s})~{}\boldsymbol{\mathrm{I}}),

(5)

where $\beta(t)=\beta_{min}+t\left(\beta_{max}-\beta_{min}\right)$ , $\beta_{min}=0.2$ and $\beta_{max}=20$ .

Thus, as desired, $\boldsymbol{\mathrm{y}}(0)$ is approximately the target and $\boldsymbol{\mathrm{y}}(1)$ is approximately distributed as $\mathcal{N}(\boldsymbol{0},\boldsymbol{\mathrm{I}})$ and does not depend on $\boldsymbol{\mathrm{y}}(0)$ .

Instead of using a diffusion process, it is also possible to use a different type of noise. For example, one could use Laplace distributed noise instead of Gaussian distributed noise. The Laplace distribution has heavier tails while having a higher concentration at its mean. The high concentration at the mean may make it easier to know how close the hint is to the truth. To add Laplace noise, we follow equation 5 but replace the Gaussian distribution by the Laplace distribution:

$\boldsymbol{\mathrm{y}}(t)|\boldsymbol{\mathrm{y}}(0)\sim Laplace\left(\boldsymbol{\mathrm{y}}(0)~{}e^{-\frac{1}{2}\int_{0}^{t}\beta(s)\mathrm{d}s},\sqrt{1-e^{-\int_{0}^{t}\beta(s)\mathrm{d}s}}\right)$ .

(6)

Our paper focuses on the Gaussian VP Process, but we also study its Laplace equivalent in a few settings.

2.2 CNT (Conditioning on Noisy Target)

Rather than solving the supervised learning problem using only $\boldsymbol{\mathrm{x}}$ , we now want to further condition on the noisy target:

\min_{\theta}\mathbb{E}_{p(\boldsymbol{\mathrm{y}}(t)|\boldsymbol{\mathrm{y}}(0)),p(\boldsymbol{\mathrm{x}},\boldsymbol{\mathrm{y}}(0))}\left[L(f_{\theta}(\boldsymbol{\mathrm{x}}|\boldsymbol{\mathrm{y}}(t),t),\boldsymbol{\mathrm{y}})\right],

(7)

where $f_{\theta}$ is now a neural network taking $\boldsymbol{\mathrm{x}}$ , $\boldsymbol{\mathrm{y}}(t)$ , and $t$ as inputs.

At the optimum, the neural network is now the conditional expectation of $\boldsymbol{\mathrm{y}}$ given $\boldsymbol{\mathrm{x}}$ , $\boldsymbol{\mathrm{y}}(t)$ , and $t$ :

f_{\theta}^{*}(\boldsymbol{\mathrm{x}}|\boldsymbol{\mathrm{y}}(t),t)=\mathbb{E}_{p(\boldsymbol{\mathrm{y}})}\left[\boldsymbol{\mathrm{y}}|\boldsymbol{\mathrm{x}},\boldsymbol{\mathrm{y}}(t),t\right].

(8)

At $t=0$ , the problem is already solved and the the optimal network returns back the target:

f_{\theta}^{*}(\boldsymbol{\mathrm{x}}|\boldsymbol{\mathrm{y}}(0),t=0)=\boldsymbol{\mathrm{y}}(0).

(9)

At $t=1$ ,the optimal network ignores the noisy label since it independent of the label:

f_{\theta}^{*}(\boldsymbol{\mathrm{x}}|\boldsymbol{\mathrm{y}}(1),t=1)=\mathbb{E}_{p(\boldsymbol{\mathrm{y}})}\left[\boldsymbol{\mathrm{y}}|\boldsymbol{\mathrm{x}}\right].

(10)

At inference, since we do not know the true label, we use $t=1$ to recover our original expected value of $\boldsymbol{\mathrm{y}}$ given $\boldsymbol{\mathrm{x}}$ .

Since we sample from a random noisy $\boldsymbol{\mathrm{y}}(1)$ , it might appear sensible to average over multiple random draws of $\boldsymbol{\mathrm{y}}(1)$ at inference; however, as shown above, the noise is completely independent of the true label at $t=1$ , and the model should ignore the noise by converging to the expected value of $\boldsymbol{\mathrm{y}}$ given $\boldsymbol{\mathrm{x}}$ . We found no benefit from using multiple random draws; it only made inference slower. We illustrate CNT with the VP process in Figure 1. During training, we condition on the $\boldsymbol{\mathrm{y}}(t)$ with a random noise-level $t$ as shown in Equation 7. At inference time, we do not know the true label, so we only provide pure noise $\boldsymbol{\mathrm{y}}(1)\sim\mathcal{N}(0,I)$ and the output should return the expected value of $\boldsymbol{\mathrm{y}}$ given $\boldsymbol{\mathrm{x}}$ .

2.3 Implementation

There are many possible ways of conditioning on the noisy target $\boldsymbol{\mathrm{y}}(t)$ . A very popular way of input conditioning is through conditional normalization layers (Huang and Belongie, 2017; De Vries et al., 2017; Karras et al., 2019; Dhariwal and Nichol, 2021); this is the approach we use.

Our approached for conditioning on the noisy target is explained in Figure 3 and below. We start by processing $t$ through a Fourier positional embedding (Tancik et al., 2020) in order to increase the high-frequency signal. Then, we separately process both $\boldsymbol{\mathrm{y}}(t)$ and the Fourier processed $t$ through small fully-connected networks. Then, we concatenate both outputs to produce an overall embedding for $(t,\boldsymbol{\mathrm{y}}(t))$ . After each normalization layer, instead of using learned fixed weights and bias, we learn noise-conditional weights and bias, which are linear projections of the embedding.

The model conditional on the noisy-target has more parameters than a unconditional model due to the embedding parameters and the weights of the linear projections. However, given a fixed $(t,\boldsymbol{\mathrm{y}}(t))$ , the network is equivalent to an unconditional neural network. Furthermore, the model should learn to ignore $\boldsymbol{\mathrm{y}}(t)$ when $t=1$ given that the noisy target contains zero information about the target at that level of noise. This relatively simple process can be adapted to any network, even those without normalization layers, by simply applying the conditional weights and bias at different locations through the network.

3 Related Work

Several lines of research explore topics related to the CNT technique.

Leveraging Top-Down Feedback.

One can learn representations in a bottom-up or top-down manner. Bottom-up refers to the scenario when the low-level (sensory) representations modulate high level (conceptual) representations. Top-Down refers to the scenario when during information processing, information from high-level representations modulate lower-level representations. Exploring the combination of top-down and bottom-up signals in network architectures has an important history in deep learning (Theeuwes, 2010; Perez et al., 2018; Anderson et al., 2018; Carreira et al., 2016; Mittal et al., 2020; Lamb et al., 2021). The goal has been to explore the value of combining top-down mechanisms for efficient learning, robustness to noise or distractors and hence achieving efficient transfer across related tasks by exploiting the task structure. The proposed regularizer is also relevant to models which incorporate feature wise modulation based on some conditioning information. The conditioning information is used to influence the computation done by a neural network. For example. conditioning information can take the form of “language instruction” to modulate the processing of a neural network learning a representation of an visual image (Perez et al., 2018). It can also take the form of goal information that is used to condition the policy as in goal based policies in reinforcement learning (Schaul et al., 2015). In the proposed framework, we use the noisy version of the “target” as the conditioning information, and we anneal the noise level to force the network to iteratively learn representations by focussing on simpler patterns in the data.

Curriculum and Progressive Learning

Training deep networks on examples which progress on a curriculum from easier to more difficult examples has been a successful strategy, with benefits for both ease of optimization and generalization (Bengio et al., 2009). (Soviany et al., 2021) identifies dozens of successful applications of curriculum learning across several domains, including computer vision, speech recognition, medical image analysis, and natural language processing. At the same time, successful use of a curriculum requires two challenging hyperparameters: a schedule for when to switch from easier to more difficult examples and an ordering of examples into easier and harder examples (Hacohen and Weinshall, 2019). In some sequence problems, it is relatively easy to design such a curriculum. For example, sorting a short list of numbers is easier than sorting a longer list of numbers. In our technique, we construct a continuum of easier and harder set of examples which we train jointly, with lower noise-levels providing easier examples and higher noise-levels leading to harder examples.

Injecting Noise into Deep Models

Training deep neural networks with noise has proved to be a surprisingly successful regularizer. The dropout technique randomly sets a percentage of the units in a deep network to zero on each forward pass (Srivastava et al., 2014) and has been highly successful across a variety of domains. Adding noise to a neural network’s weights has also been shown to be useful in reinforcement learning (Fortunato et al., 2017). Injecting noise into the hidden states of a neural network using a variational bottleneck has also been shown to be a successful regularizer (Alemi et al., 2016). CNT shares the idea of injecting noise into the hidden states of a deep network (through conditional normalization layers), and we found that this by itself often improved results over the baseline. However, an important difference with CNT is that conditioning is performed on a noisy target with a variable amount of noise, and the model is also conditioned on the noise level being injected. Because the target contains information which is useful for solving the task, the network has a natural inductive bias to leverage the noisy target.

Denoising Generative Models

CNT can be seen as a model for denoising the target $\boldsymbol{\mathrm{y}}$ conditioned on the input $\boldsymbol{\mathrm{x}}$ . Denoising (typically done over multiple steps) has frequently been used to construct generative models $p(\boldsymbol{\mathrm{x}})$ or conditional generative models $p(\boldsymbol{\mathrm{x}}|\boldsymbol{\mathrm{y}})$ . (Bengio et al., 2013) showed that denoising autoencoders can be used as generative models by running multiple steps of denoising during the sampling process. Work on deep diffusion, which involves learning to denoise from multiple levels of noise during both training and sampling has led to competitive conditional and unconditional generative models (Ho et al., 2020, 2021). Our work differs from these other works in that we focus exclusively on the discriminative task of modeling $p(\boldsymbol{\mathrm{y}}|\boldsymbol{\mathrm{x}})$ . Moreover this distribution is usually nearly uni-modal (or has only a few modes), which means that the denoising process can be completed perfectly in only a single step. This removes the need to run multiple steps during sampling and also removes the need to draw multiple samples.

4 Experiments and Results

To show the generalization benefits of CNT in supervised learning, we test CNT over a wide range of problems: image classification, relational reasoning (Sort-of-CLEVR), equilateral vs. non-equilateral shape classification, and reinforcement learning.

To ensure that the benefit we see in CNT comes from the label information in the noisy target and not simply from noise conditioning, we always test for noise-only conditioning: $\beta_{min}=\beta_{max}$ in the VP process so that, for any $t$ , it is pure standard Gaussian noise. It is well known that regularization through injecting noise can sometimes be beneficial; thus, it is important to compare CNT to only-noise.

Preliminary analyses showed that having the Mish activation function in the CNT embedding generally leads to slightly better results (accuracy in classification and Shapes) than using ReLU. Thus, when comparing only-noise and CNT to baseline models using ReLU, we use still use Mish inside the embedding and in the linear projections for the conditional normalization. Having a continuous embedding ensures that the embedding does not change significantly with respect to small changes in the noise-level $t$ , and we believe that this is beneficial.

4.1 Classification

We test CNT for classification with the following datasets: CIFAR-10 (Krizhevsky et al., 2009), CIFAR100 (Krizhevsky et al., 2009), TinyImageNet (Chrabaszcz et al., 2017), and ImageNet (Deng et al., 2009). We use a basic ResNet architecture (He et al., 2016) for all out models.

We optimize the cross-entropy loss function. As regularizations, we use dropout (Srivastava et al., 2014) with $p=0.1$ (Srivastava et al., 2014) and the mixup data augmentation (Zhang et al., 2017) with $\alpha=1$ . The models are trained with Stochastic Gradient Descent (SGD) with learning rate $0.1$ , momentum $0.9$ , and weight decay $0.0001$ . The learning rate is scheduled to decrease by a factor of 10 at $50\%$ and $80\%$ of the training time. TinyImagenet models are trained for 1000 epochs, while other models are trained for 2000 epochs. To reduce variance, we train on all settings three times and report the mean and standard deviation. Results are shown in Table 1.

Table 1: Classification: Test accuracy mean (standard deviation) from 3 seeds

Model	Baseline	only-noise	CNT
TinyImageNet
ResNet18 - Mish	63.77 (0.20)	63.94 (1.02)	65.32 (0.59)
ResNet18 - ReLU	65.31 (0.05)	64.38 (0.00) ¹¹1Some models collapsed	64.87 (0.80)
CIFAR-100
ResNet18 - Mish	79.21 (0.36)	79.29 (0.84)	80.23 (0.67)
ResNet18 - ReLU	79.93 (0.30)	79.72 (0.58)	80.33 (0.90)
CIFAR-10
ResNet18 - Mish	96.74 (0.06)	96.81 (0.15)	96.70 (0.07)
ResNet18 - ReLU	96.63 (0.15)	96.79 (0.07)	96.70 (0.11)

Table 2: Low-capacity classification: Test accuracy mean (standard deviation) from 3 seeds

Model	Baseline	only-noise	CNT
CIFAR-100 ( $ch=64$ )
ResNet9	72.11 (0.80)	71.50 (4.33)	74.23 (0.17)
CIFAR-100 ( $ch=8$ )
ResNet9	44.45 (0.22)	53.71 (0.89)	54.19 (0.86)
ResNet18	57.94 (0.36)	63.62 (0.81)	63.76 (0.51)
TinyImagenet ( $ch=64$ )
ResNet9	45.45 (0.10)	53.50 (0.21)	52.65 (0.41)
TinyImagenet ( $ch=8$ )
ResNet9	19.05 (0.79)	25.34 (0.53)	33.45 (0.31)
ResNet18	34.46 (1.32)	39.51 (0.60)	41.64 (2.51)

From Table 1, we see that CIFAR-10 shows better results for only-noise, CIFAR-100 for CNT, and TinyImageNet for either CNT or baseline. Thus, only-noise and CNT almost always perform better than baseline, albeit by a small margin. The fact that only-noise is sometimes better (in CIFAR-10) suggests that adding noise may have its own benefits.

In the next set of experiments, we show that the benefits of CNT become larger and more consistent when the classification models have low capacity. To test the effect of CNT in a low-capacity model, we train the previous models with only 8 channels or/and with half the number of layers. These experiments use mixup $\alpha=0.5$ . Results are shown in Table 2. From 2, we see that CNT almost always leads to higher accuracy than the other methods except for ResNet9 with TinyImageNet, which obtains slightly better accuracy with only-noise. This provides evidence that conditioning on the noisy target can be very beneficial, especially when the model struggles due to low capacity.

4.2 Detecting Equilateral Shapes

Table 3: Equilateral Shape Detection: Test accuracy mean (standard deviation) from 3 seeds

Model	Baseline	only-noise	CNT
Shapes
CaffeNet - Mish	83.57 (2.13)	85.03 (0.80)	85.77 (0.21)
CaffeNet - ReLU	85.23 (0.40)	86.03 (0.31)	86.37 (0.65)
ResNet18 - Mish	96.63 (0.15)	96.77 (0.06)	96.80 (0.10)
ResNet18 - ReLU	96.57 (0.23)	96.83 (0.21)	96.87 (0.06)

We test CNT on the task of detecting geometrical shapes. We build on the equilateral triangle detection task from (Ahmad and Omohundro, 2009). In addition to triangles, we also add squares and rectangles to our task. Each image in the dataset is of size $64\times 64$ . Each vertex of a geometrical shape is represented by a closely formed cluster of points centered at a randomly chosen coordinate in the image. For triangles, we add three such clusters to the image and for quadrangles, we add four such clusters to the image. We ensure that the positioning of these clusters satify the geometrical properties of the shape they represent. For example, we ensure that the clusters are equidistant for the equilateral triangle and the square. Also, we ensure that the sides for the quadrangles are perpendicular to each other.

This task is framed as a classification task where we have two classification heads - One denotes whether the given shape has 4 sides or 3 sides and the second denotes whether the given shape has equal length sides or not. We use a basic ResNet architecture (He et al., 2016) and CaffeNet (Jia et al., 2014). We use dropout (Srivastava et al., 2014) with $p=0.1$ (Srivastava et al., 2014) in the ResNet architecture. We incorporate noise by feeding the noisy label through the normalization layers in each architecture similar to Figure 3. We minimize a binary cross entropy loss function for both classification heads. The models are trained for 200 epochs with Stochastic Gradient Descent (SGD) with learning rate $0.1$ , momentum $0.9$ , and weight decay $0.0001$ . The learning rate is scheduled to decrease by a factor of 10 at $50\%$ and $80\%$ of the training time. To reduce the variance, we train each model three times and report the mean and standard deviation. For the result, we report the results for the second classification head i.e. whether the given shape has equal length sides or not.

We report the results in Table 3, we see that CNT consistently leads to higher accuracy in all settings. The effect of CNT is more significant with CaffeNet, probably because this architecture has a lower capacity than ResNet.

4.3 Relational Reasoning (Sort-of-CLEVR)

Generalization on relational reasoning tasks has often been difficult for deep supervised networks, which has been partially successfully addressed through the use of curriculum learning to encourage learning simpler skills which can be re-used and re-purposed to solve more complex tasks which require multiple skills. For example, if a model needs to solve a relational question such as: “What color is the shape to the left of the triangle?”, the model needs to be able to recognize colors, detect shapes, and understand relative positioning of shapes. If a model masters all of these simpler individual skills it will generalize better when solving novel composite tasks which use many different skills.

We evaluated CNT on the Sort-of-CLEVR relational reasoning benchmark (Santoro et al., 2017). Each image in the sort-of-clevr benchmark is of size $75\times 75$ . The input consists of a question and an image. The task contains 3 types of questions - Unary, Binary, and Ternary. Unary questions consider properties of single objects. For example, what is the color of the square?. Binary questions consider relations between two objects. For example, what is the shape of the object closest to the red object?. Similarly, ternary questions consider relations between 3 obejcts. The Sort-of-CLEVR images consist of 6 randomly placed geometrical shapes of 6 possible colors and 2 possible shapes.

To demonstrate the generality of CNT, we evaluate both convolutional resnets (Preactresnet18 with 64 initial channels) and transformers on the Sort-of-CLEVR benchmark. Unlike with equilateral shape classification and image classification, we used relu everywhere in the convolutional network for the Sort-of-CLEVR tasks. We also injected a question-embedding along with the label and noise-level embeddings as a way of conditioning on the question. We report results on Sort-of-CLEVR (Table 4). Additionally, we found that accuracy converged fastest at the lowest noise levels (Figure 4) and more gradually at higher noise levels.

Table 4: Sort-Of-Clever (Ternary: question relating to three object, Binary: question relating to two object, Unary: question relating to one object): Test accuracy mean (standard deviation) from 5 seeds

Model	Ternary	Binary	Unary
ResNet - ReLU & Batch-norm
Baseline	74.00 (1.49)	78.34 (8.38)	99.93 (0.09)
only-noise	76.86 (1.52)	85.17 (6.56)	100.0 (0.00)
CNT	74.80 (2.38)	90.73 (1.30)	100.0 (0.00)
ResNet - Mish & Batch-norm
Baseline	75.60 (3.52)	91.93 (2.39)	93.17 (6.63)
only-noise	76.27 (2.98)	89.08 (5.67)	97.90 (2.37)
CNT	73.80 (3.03)	86.49 (8.35)	99.63 (0.66)
ResNet - ReLU & Group-norm
Baseline	83.72 (0.70)	97.43 (0.11)	100.0 (0.00)
only-noise	83.64 (0.59)	97.77 (0.16)	99.98 (0.05)
CNT	82.44 (0.86)	97.91 (0.23)	100.0 (0.00)
ResNet - Mish & Group-norm
Baseline	83.15 (0.81)	97.84 (0.20)	100.0 (0.00)
only-noise	83.72 (0.59)	97.89 (0.29)	100.0 (0.00)
CNT	82.81 (0.80)	97.49 (0.68)	100.0 (0.00)
Transformers
Baseline	54.40 (0.80)	72.20 (2.48)	78.40 (11.60)
only-noise	55.6 (2.87)	77.0 (2.45)	98.2 (0.40)
CNT	52.40 (0.49)	73.00 (5.02)	83.60 (12.60)

4.4 Reinforcement Learning with Decision Transformers

We test CNT on the atari benchmark (Bellemare et al., 2012) following the same setup as decision transformer (Chen et al., 2021). The key in decision transformer is the choice if trajectory representation. The trajectory is represented such that it allows conditional generation of actions based on future expected rewards. This is done by conditioning the model on return-to-go $\hat{R}_{k}=\sum_{k^{\prime}=k}^{K}r_{k}$ , where $k$ denotes the timesteps. This results in the following trajectory representation: $\tau=\big{(}\hat{R}_{1},s_{1},a_{1},\hat{R}_{2},s_{2},a_{2},\hat{R}_{3},s_{3},a_{3},\ldots\big{)}$ . At test time, we can condition the desired return $\hat{R}_{1}$ and the start state $s_{1}$ to generate actions.

Similar to decision transformer, we use a fixed context of length of $K$ to train our models i.e. we only feed in the last $K$ timesteps. This results in a sequence length of $3K$ (considering 3 modalities - returns, states, and actions). Each modality is processed into an embedding - The states are processed using a convolutional encoder into an embedding, the returns and actions are processed using linear layers. The processed tokens are fed into a GPT (Radford and Narasimhan, 2018) model. The outputs corresponding to $s_{k}$ are fed into a linear layer to predict the action $a_{k}$ to be taken at $k$ . To incorporate CNT, we replace the layer normalization layers in GPT with conditional normalization layers. During training, we feed the noisy actions $a_{k}(t)$ and $t_{k}$ through the conditional normalization layers and the model is tasked with predicting $a_{k}(0)$ (the true action).

We train on 500000 examples from DQN replay dataset similar to (Agarwal et al., 2019). We run on 4 atari games - Qbert, Seaquest, Pong, and Breakout. We use a context length ( $K$ ) of 30. We report the mean and variance across 10 seeds. We report the results in Table 5.

Table 5: Here we report the returns obtained by the models on the four Atari Games studied in the Decision Transformer paper (Chen et al., 2021). Mean and standard deviation are reported across 10 seeds.

Model	Baseline	CNT
Qbert	3084.22 (1560.93)	3390.22 (2360.71)
Seaquest	894.67 (368.57)	1106.75 (148.74)
Breakout	74.67 (14.55)	63.22 (9.60)
Pong	11.22 (6.16)	9.22 (6.78)

4.5 Injecting Laplace Noise

To demonstrate the generality of CNT, we also trained with noisy targets where the noise follows a Laplace distribution instead of a Gaussian distribution (Figure 5). Aside from this change, we kept all hyperparameters the same and used the Mish activation. These results are shown in Table 6.

Table 6: We report classification accuracy to compare CNT with gaussian noise and CNT with Laplace noise. All methods use the Mish activation and PreActResnet18 architecture).

Model	Baseline	Gaussian	Laplace
CIFAR-10	96.74	96.70	96.94
CIFAR-100	79.21	80.23	78.54
Tiny-Imagenet	63.77	65.32	65.68

5 Limitations

We tested our approach on a wide variety of tasks to verify its general validity. However, we only tested one way of conditioning on the noisy target, which is not necessarily the best.

Progressive learning has known theoretical benefits on generalization. However, there is less theoretical evidence (as far as we can tell) for the benefits of multi-scale methods on generalization. It is possible that using a specific annealing process to decrease the noise during training in a progressive learning fashion would also provide good improvements in generalization. However, we did not test this idea as we prefer the multi-scale approach’s simplicity, which does not require any hyper-parameter tuning. Our goal is to have a simple regularization method that does not require tuning.

We injected noise directly into the target space (for example, one-hots for classification problems), which does not reflect the semantic structure of the problem being solved. We used this approach because it is fully general and does not assume any special knowledge. However, it would also be interesting to inject noise into a semantically meaningful space instead of the original target space.

We used Gaussian noise due to its simplicity and the fact that the VP process has obtained great success in score-based generative models. We also used Laplace noise. However, other types of noise could be tried.

6 Conclusion

We devised a new regularizer called CNT, which leverages top-down feedback and multi-scale learning by conditioning the neural network on a noisy target. This noisy target acts as a hint that allows the model to succeed faster on easier tasks with less noise than harder ones. CNT thus provides a continuous path for the model to progress from easy to hard examples. We show that we obtain higher generalization when using CNT on a wide variety of different tasks. The improvement from using CNT is particularly significant in reinforcement learning and when the model has low capacity. This suggests that conditioning on a noisy target is especially beneficial for difficult problems.

Notably, only conditioning on pure noise also offered a slight benefit; this suggests that conditioning on noise may be beneficial on its own. We hypothesize that the network still attempts to solve the problem as multiple sub-problems (conditional on the value of $t$ ) even though the problem is the same at every value of $t$ ; this may regularize the network and reduce overfitting. However, given the more substantial theoretical and empirical evidence for CNT, it is more sensible to condition on the noisy target than simply on noise when conditioning on the noisy target is no more difficult.

References

Fayek et al. [2018] Haytham M Fayek, Lawrence Cavedon, and Hong Ren Wu. On the transferability of representations in neural networks between datasets and tasks. arXiv preprint arXiv:1811.12273, 2018.
Bengio et al. [2009] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48, 2009.
Terekhov et al. [2015] Alexander V Terekhov, Guglielmo Montone, and J Kevin O’Regan. Knowledge transfer in deep block-modular neural networks. In Conference on Biomimetic and Biohybrid Systems, pages 268–279. Springer, 2015.
Rusu et al. [2016] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
Karras et al. [2017] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
Fayek [2018] Haytham Fayek. Continual deep learning via progressive learning. PhD thesis, RMIT University, 2018.
Rauss and Pourtois [2013] Karsten Rauss and Gilles Pourtois. What is bottom-up and what is top-down in predictive coding? Frontiers in Psychology, 4:276, 2013.
Kinchla and Wolfe [1979] R. A. Kinchla and J. M. Wolfe. The order of visual processing: “top-down,” “bottom-up,” or “middle-out”. Perception & Psychophysics, 25:225–231, 1979.
Karnewar and Iyengar [2019] Animesh Karnewar and Raghu Sesha Iyengar. Msg-gan: Multi-scale gradients gan for more stable and synchronized multi-scale image synthesis. arXiv preprint arXiv:1903.06048, 2019.
Karras et al. [2020] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of StyleGAN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8110–8119, 2020.
Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. arXiv preprint arxiv:2006.11239, 2020.
Song et al. [2020] Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations, 2020.
Huang and Belongie [2017] Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, pages 1501–1510, 2017.
De Vries et al. [2017] Harm De Vries, Florian Strub, Jérémie Mary, Hugo Larochelle, Olivier Pietquin, and Aaron Courville. Modulating early visual processing by language. arXiv preprint arXiv:1707.00683, 2017.
Karras et al. [2019] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019.
Dhariwal and Nichol [2021] Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis. arXiv preprint arXiv:2105.05233, 2021.
Tancik et al. [2020] Matthew Tancik, Pratul P Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. arXiv preprint arXiv:2006.10739, 2020.
Theeuwes [2010] Jan Theeuwes. Top–down and bottom–up control of visual selection. Acta psychologica, 135(2):77–99, 2010.
Perez et al. [2018] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
Anderson et al. [2018] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6077–6086, 2018.
Carreira et al. [2016] Joao Carreira, Pulkit Agrawal, Katerina Fragkiadaki, and Jitendra Malik. Human pose estimation with iterative error feedback. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4733–4742, 2016.
Mittal et al. [2020] Sarthak Mittal, Alex Lamb, Anirudh Goyal, Vikram Voleti, Murray Shanahan, Guillaume Lajoie, Michael Mozer, and Yoshua Bengio. Learning to combine top-down and bottom-up signals in recurrent neural networks with attention over modules. In International Conference on Machine Learning, pages 6972–6986. PMLR, 2020.
Lamb et al. [2021] Alex Lamb, Anirudh Goyal, Agnieszka Słowik, Michael Mozer, Philippe Beaudoin, and Yoshua Bengio. Neural function modules with sparse arguments: A dynamic approach to integrating information across layers. In International Conference on Artificial Intelligence and Statistics, pages 919–927. PMLR, 2021.
Schaul et al. [2015] Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approximators. In International conference on machine learning, pages 1312–1320. PMLR, 2015.
Soviany et al. [2021] Petru Soviany, Radu Tudor Ionescu, Paolo Rota, and Nicu Sebe. Curriculum learning: A survey. arXiv preprint arXiv:2101.10382, 2021.
Hacohen and Weinshall [2019] Guy Hacohen and Daphna Weinshall. On the power of curriculum learning in training deep networks. In International Conference on Machine Learning, pages 2535–2544. PMLR, 2019.
Srivastava et al. [2014] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
Fortunato et al. [2017] Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, et al. Noisy networks for exploration. arXiv preprint arXiv:1706.10295, 2017.
Alemi et al. [2016] Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, and Kevin Murphy. Deep variational information bottleneck. CoRR, abs/1612.00410, 2016. URL http://arxiv.org/abs/1612.00410.
Bengio et al. [2013] Yoshua Bengio, Li Yao, Guillaume Alain, and Pascal Vincent. Generalized denoising auto-encoders as generative models. In Advances in neural information processing systems, pages 899–907, 2013.
Ho et al. [2021] Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. arXiv preprint arXiv:2106.15282, 2021.
Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
Chrabaszcz et al. [2017] Patryk Chrabaszcz, Ilya Loshchilov, and Frank Hutter. A downsampled variant of imagenet as an alternative to the cifar datasets. arXiv preprint arXiv:1707.08819, 2017.
Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Zhang et al. [2017] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
Ahmad and Omohundro [2009] Subutai Ahmad and Stephen M. Omohundro. Equilateral triangles: A challenge for connectionist vision. 2009.
Jia et al. [2014] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678, 2014.
Santoro et al. [2017] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4967–4976. Curran Associates, Inc., 2017.
Bellemare et al. [2012] Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents. CoRR, abs/1207.4708, 2012. URL http://arxiv.org/abs/1207.4708.
Chen et al. [2021] Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. CoRR, abs/2106.01345, 2021. URL https://arxiv.org/abs/2106.01345.
Radford and Narasimhan [2018] Alec Radford and Karthik Narasimhan. Improving language understanding by generative pre-training. 2018.
Agarwal et al. [2019] Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. Striving for simplicity in off-policy deep reinforcement learning. CoRR, abs/1907.04543, 2019. URL http://arxiv.org/abs/1907.04543.