Augmentation Pathways Network
for Visual Recognition

Yalong Bai, Mohan Zhou, Wei Zhang, Bowen Zhou, and Tao Mei, Yalong Bai, Wei Zhang, Bowen Zhou, Tao Mei are with JD Explore Academy, Beijing, China, 100010.
E-mail: [email protected] Mohan Zhou is with Harbin Institute of Technology.

Abstract

Data augmentation is practically helpful for visual recognition, especially at the time of data scarcity. However, such success is only limited to quite a few light augmentations (e.g., random crop, flip). Heavy augmentations are either unstable or show adverse effects during training, owing to the big gap between the original and augmented images. This paper introduces a novel network design, noted as Augmentation Pathways (AP), to systematically stabilize training on a much wider range of augmentation policies. Notably, AP tames various heavy data augmentations and stably boosts performance without a careful selection among augmentation policies. Unlike traditional single pathway, augmented images are processed in different neural paths. The main pathway handles the light augmentations, while other pathways focus on the heavier augmentations. By interacting with multiple paths in a dependent manner, the backbone network robustly learns from shared visual patterns among augmentations, and suppresses the side effect of heavy augmentations at the same time. Furthermore, we extend AP to high-order versions for high-order scenarios, demonstrating its robustness and flexibility in practical usage. Experimental results on ImageNet demonstrate the compatibility and effectiveness on a much wider range of augmentations, while consuming fewer parameters and lower computational costs at inference time.

Index Terms:

Visual Recognition, Data Augmentation, Neural Network Design, Augmentation Pathways Network.

1 Introduction

Deep convolutional neural networks (CNN) have achieved remarkable progress on visual recognition. In some cases, deep models are likely to overfit the training data as well as its noisy signals [1], even on a large-scale dataset such as ImageNet [2, 3]. Data augmentation usually serves as a standard technique for regularizing the training process and reducing the generalization error, especially when data annotations are scarce.

However, such successes in data augmentation are only restricted to a handful of augmentations that slightly jitters the original image. A large collection of augmentation operations can not be easily applied to arbitrary configurations (e.g., datasets, backbones, hyper-parameters). Sometimes data augmentation only shows marginal or even adverse effects on image classification. Following the definition in prior works (e.g., SimCLR [4], imgaug toolkit [5], DSSL [6]), we roughly group augmentation operations into two categories (Fig. 1 left). 1) Light Augmentation that only slightly modifies an image without significant information loss. Typical operations include random Flip, Crop [2, 7, 8, 9]. Note that the original image can also be treated as a special case of light augmentation (i.e., Identity). 2) Heavy Augmentation (or named Strong Augmentation [10]) that largely alters the image appearance, sometimes striping out a significant amount of information (such as color, object structure). Typical operations include Gray (transforming color image to grayscale), GridShuffle [11] (destructing object structures by shuffling image grids) and CutOut [12] (masking out random area of image), etc.

Refer to caption — Figure 1: Left: Examples of original images and their lightly augmented (randomly Resize, Crop, Flip) and heavily augmented (Gray, GridShuffle, RandAugment) versions. Middle: Improvement on Top-1 accuracy by applying two heavy augmentations (Gray and GridShuffle) on ImageNet and its subsets (ImageNet_n, $n$ indicates the number of images used per category). Standard network (ResNet-50) performs quite unstable, showing marginal or adverse effects. Right: Improvement on Top-1 accuracy by applying searched augmentation (RandAugment [13]: A collection of randomly selected heavy augmentations) on ImageNet. Augmentation policy searched for ResNet-50 leads to performance drop on iResNet-50. In contrast, Augmentation Pathways (AP) based network can steadily benefit from a much wider range of augmentation policies for robust classification.

Based on prior studies [2, 7, 14], light augmentations have demonstrated stable performance improvements, since lightly augmented images usually share very similar visual patterns with the original ones. However, heavy augmentations inevitably introduce noisy feature patterns, following different distributions with the original samples. Thus training directly with these images are often unstable, sometimes showing adverse effect in performance. For example in Fig. 1 (Middle), GridShuffle is highly unstable on ImageNet, if trained with standard network (see ResNet column). This may be due to the implicit gap among three sets of “train, augmented, test” data.

Intuitively, heavy augmentations also introduce helpful and complementary information during training [11]. Recent studies [15, 16] also suggest that networks trained with heavier augmentation yield representations that are more similar between deep neural networks and human brain. However, heavy augmentation tends to generate images with larger variations from the original feature space. Such variations are not always helpful, since irrelevant feature bias is also introduced alongside the augmentation. From the opposite view, there is still useful information implied in the shared visual patterns between the original and heavily augmented images. For example, contour information is augmented, but color bias is introduced in Gray augmentation; visual details are augmented, while object structure is destroyed in GridShuffle augmentation [11]. Therefore, expertise and knowledge are required to select feasible data augmentation policies [11]. In most cases, this is quite cumbersome. Even when augmentation improvements have been found for one specific domain or dataset, they often do not transfer well to other datasets. Some previous works employ search algorithms or adversarial learning to automatically find suitable augmentation policies [17, 18, 19, 13]. However, such methods require additional computation to obtain suitable policies. Moreover, augmentation policies searched for one setting are usually difficult to fit other settings. For example in Fig. 1 (Right), RandAugment [13] searched for ResNet leads to slight performance drop in iResNet [20] (an information flow version of ResNet).

In this work, we design a network architecture to handle a wide range of data augmentation policies, rather than adapt augmentation policies for specific datasets or architectures. A plug-and-play “Augmentation Pathways” (AP) is proposed for restructuring the neural paths by discriminating different augmentation policies. Specifically, a novel augmentation pathway based convolution layer (AP-Conv) is designed to replace standard Conv layer to stabilize training with a wide range of augmentations. As an alternative to the standard convolutional layer, AP-Conv adapts network design to a much wider range of heavy data augmentations. As illustrated in Fig. 2, traditional convolutional neural networks directly feed all images into the same model. In contrast, our AP-Conv (right of Fig. 2) process the lightly and heavily augmented images through different neural pathways. Precisely, a basic AP-Conv layer consists of two convolutional pathways: 1) the main pathway focuses on light augmentations, and 2) the augmentation path is shared among lightly and heavily augmented images for learning common representations for recognition. Two pathways interact with each other through the shared feature channels. To further regularize the feature space, we also propose an orthogonal constraint to decouple features learned from different pathways. Notably, our AP-Conv highlights the beneficial information shared between pathways and suppresses negative variations from heavy data augmentation. In this way, the Augmentation Pathways network can be naturally adapted to different data augmentation policies, including manually designed and auto-searched augmentations.

Furthermore, different augmentation hyperparameters may lead to different visual appearances and classification accuracy. Tuning such hyperparameters is non-trivial. Some works propose to automatically search for a proper hyperparameter. However, these methods usually require additional computation or searching cost [17], and the learned augmentation policies are dataset or network dependent [18, 21]. Thus these methods are usually with limited generalization capability. To address this, we gather all useful information from one augmentation policy with various hyperparameters, instead of selecting one most appropriate hyperparameter as previous works did. Specifically, we extend the augmentation pathways into high-order for processing training data from multiple hyperparameter selections of data augmentation pass different pathways. In this way, the information dependencies among different hyperparameters of data augmentation policies can be well structured, and the information from different neural network pathways can be gathered to organize a well-structured and rich feature space.

Comparing to the standard convolutional layer, our AP-Conv contains fewer connections and parameters. Moreover, it is highly compatible with standard networks. AP-Conv based network can even be directly finetuned from the standard CNN. The experimental results on ImageNet dataset demonstrated AP-Conv’s efficiency and effectiveness by equipping manually designed heavy augmentations and the searched data augmentations collection.

2 Related Work

Manually designed augmentation Since data augmentation can increase the training data diversity without collecting additional samples, it usually plays an essential role in deep neural network based vision tasks and benefits the model generalization capability and performance improvement as a standard operation in deep vision model training. In general, light data augmentation policies, including random cropping, horizontal flips are commonly used in various tasks [14, 22, 23, 24]. Such data augmentation methods keep the augmented images in the original training set and lead to steady performance improvement in different neural network architectures trained on various datasets. Recently, heavy data augmentation methods have received more attention from the computer vision research community. Some methods [12, 25, 26] randomly erase image patches from the original image or replace the patches with random noise. GridShuffle [11] is proposed for destructing the global structure of the object in images and force the model to learn local detail features. However, such manually designed heavy data augmentation is dataset-specific and usually suffer from adapting to different datasets.

Searched augmentation Inspired by the successes of Neural Architecture Search algorithms on various computer vision tasks [27, 28], there are several current studies proposed for automatically search algorithms to obtain augmentation policies for given datasets and network architectures. These studies try to find the best augmentation policy collection from the predefined transformation functions by RL based strategy [17], Population based training [21], Bayesian optimization [18] or the latest grid search based algorithms [13]. Such methods usually takes lots of GPU hours for searching a proper data augmentation collection before training model. Moreover, theoretically, these data augmentation strategies are dataset specific and network architecture specific. These two limitations hurt the practical value of the searched-based data augmentation methods.

In the paper, we introduce a new viewpoint for the inter-dependency among dataset, network architecture, and data augmentation policies. Rather than selecting proper data augmentation policies for each dataset or network architecture, we propose a network architecture design method for dealing with various data augmentations, including not only the manually designed augmentation but also searched augmentation. With lower computational cost, our method can achieve stable performance improvements on various network architectures and datasets equipping different kinds of data augmentation methods.

3 Methodology

In this section, we start with a general description of the basic augmentation pathway (AP) network (Sec. 3.1), then introduce two extensions of AP (Sec. 3.2) for handling multiple hyper-parameters of given augmentation policy.

We focus on deep convolutional neural network (CNN) based fully supervised image classification problem. A typical CNN architecture consists of $T$ stacked convolutional layers $\{c_{1},c_{2},...,c_{T}\}$ , and a classifier $f$ . Given training image $I_{i}$ with its category label $l_{i}$ , $\phi_{i}$ denotes the lightly augmented version of $I_{i}$ . Note that the original input image $I$ can be regarded as a special case of $\phi$ . The overall objective of a typical image classification network is to minimize:

\begin{split}\mathcal{L}_{cls}&=\sum_{i=1}^{N}\mathcal{L}\left(f(c_{T}(\phi_{i})),l_{i}\right),\end{split}

(1)

where $c_{t}(\phi_{i})=W_{t}c_{t-1}(\phi_{i})+b_{t}$ , $\mathcal{L}$ is the cross-entropy loss, $W_{t}\in\mathbb{R}^{n_{t-1}\!\times\!h_{t}\!\times\!w_{t}\!\times\!n_{t}}$ , $b_{t}\in\mathbb{R}^{n_{t}\!\times\!1}$ are the learnable parameters in $c_{t}$ with kernel size $h_{t}\!\times\!w_{t}$ , $n_{t-1}$ and $n_{t}$ are the sizes of input and output channels of $c_{t}$ , respectively.

3.1 Augmentation Pathways (AP)

We first introduce convolutional operations with augmentation pathways (AP-Conv), the basic unit of our proposed AP network architecture. Different from the standard convolution $c_{t}$ ( $t=1,...,T$ , denoting the layer index), AP version convolution $\mathbb{n}_{t}$ consists of two convolutions $c^{1}_{t}$ and $c^{2}_{t}$ . $c^{1}_{t}$ is equipped in the main pathway, learning feature representations of lightly augmented input $\phi$ (with similar distributions with original images). $c^{2}_{t}$ is the pathway to learn shared visual patterns between lightly augmented image $\phi$ and heavily augmented image $\varphi$ . $\varphi$ varies from different data augmentation policies, and differs from the original original image distribution. The operations of a basic AP-Conv $\mathbb{n}_{t}$ can be defined as:

$\displaystyle\mathbb{n}_{t}(\phi_{i})$	$\displaystyle=c^{1}_{t}(\phi_{i})+\!\!\!\!+\,c^{2}_{t}(\phi_{i})$
	$\displaystyle=\left(W^{1}_{t}\mathbb{n}_{t-1}(\phi_{i})\!+\!b^{1}_{t}\right)+\!\!\!\!+\,\left(W^{2}_{t}c^{2}_{t-1}(\phi_{i})\!+\!b^{2}_{t}\right),$
$\displaystyle\mathbb{n}_{t}(\varphi_{i})$	$\displaystyle=c^{2}_{t}(\varphi_{i})=W^{2}_{t}c^{2}_{t-1}(\varphi_{i})+b^{2}_{t},$	(2)

where $+\!\!\!\!+\,$ indicates the vector concatenation operation, $W^{1}_{t}\in\mathbb{R}^{n_{t-1}\!\times\!h_{t}\!\times\!w_{t}\!\times\!(n_{t}\!-\!m_{t})}$ , $b^{1}_{t}\in\mathbb{R}^{(n_{t}-m_{t})\!\times\!1}$ and $W^{2}_{t}\in\mathcal{R}^{m_{t-1}\!\times\!h_{t}\!\times\!w_{t}\!\times\!m_{t}}$ , $b^{2}_{t}\in\mathbb{R}^{m_{t}\!\times\!1}$ represent the convolutional weights and biases of $c^{1}_{t}$ and $c^{2}_{t}$ respectively. $m_{t-1}$ and $m_{t}$ denote the numbers of input and output channels of $\mathbb{n}_{t}$ for processing heavily augmented inputs and lightly augmented inputs jointly, which is smaller than $n_{t}$ . For light augmentation inputs, the output size of $\mathbb{n}_{t}$ is same with $c_{t}$ . As shown in Fig. 3, AP-Conv contains two different neural pathways inner one neural layer for $\phi$ and $\varphi$ respectively.

TABLE I: Examples of data augmentations with their hyperparameters. Gray, Blur, Gridshuffle, MPN are manually designed heavy augmentations. RandAugment is a searched augmentation combination including 14 different image transformations (e.g., Shear, Equalize, Solarize, Posterize, Rotate. Most of them are heavy transformations).

Augmentation

Hyperparameter

Description

Gray

the alpha value

\alpha\in[0,1]

of the grayscale image when overlayed

over the original image for Gray

\alpha

close to 1.0 means that mostly the

new grayscale image is visible

Blur

the kernel size

k

of Blur

larger

k

leads to more blurred image

GridShuffle

the number of grids

g\times g

in image for GridShuffle

larger

g

results in smaller grid and the

image is destructed more drastically

MPN

the scaling factor

s

of pixel values for Multiplicative Noise

larger

s

results in brighter image

RandAugment [13]

the number

n

of augmentation transformations to apply sequentially,

and magnitude

m

for all the transformations

larger

n

and

m

results in heavier

augmented image

Comparison to Standard Convolution A standard convolution can be transformed into a basic AP-Conv by splitting an augmentation pathway and disabling a fraction of connections. In general, the number of parameters in $\mathbb{n}_{t}$ is $\delta_{t}$ less than a standard convolution under same settings, where

\delta_{t}=(n_{t-1}-m_{t-1})\times m_{t}\times h_{t}\times w_{t}.

(3)

For example, if we set $m_{t}\!=\!\frac{1}{2}n_{t}$ and $m_{t-1}\!=\!\frac{1}{2}n_{t-1}$ , AP-Conv only contains 75% parameters in the standard Conv.

The only additional operation in AP-Conv is a conditional statement to assign the features of $\phi$ to $c^{1}_{t}$ and $c^{2}_{t}$ , or feed the features of $\varphi$ to $c^{2}_{t}$ .

Augmentation Pathways based Network The key idea of basic augmentation pathways based network is to mine the shared visual patterns between two pathways handling inputs following different distributions. A basic constraint is that the shared features should boost object classification, which is also common objective functions of two different neural pathways:

	$\displaystyle\mathcal{L}_{cls}$	$\displaystyle=\sum_{i=1}^{N}\mathcal{L}\left(f_{\phi}(\mathbb{n}_{T}(\phi_{i})),l_{i}\right)+\mathcal{L}\left(f_{\varphi}(\mathbb{n}_{T}(\varphi_{i})),l_{i}\right)+\lambda S_{i}$
	$\displaystyle S_{i}$	$\displaystyle=\sum_{t=1}^{T}\left\langle c^{1}_{t}(\phi_{i}),c^{2}_{t}(\phi_{i})\right\rangle,$		(4)

where $f_{\phi}$ and $f_{\varphi}$ are the classifiers for light and heavy augmentations respectively, $S$ is a Cross Pathways Regularization item to measure the similarity of visual patterns between neural pathways. The formulation of $S$ is similar to the standard weight decay. Both of them are L2 regularization. Denoting the loss weight of standard decay as $\omega$ , for all experiments in our paper, we simply set $\lambda=0.1\omega$ . Minimizing $S_{i}$ penalizes filter redundancy in $c^{1}_{t}$ and $c^{2}_{t}$ . As a result, $c^{1}_{t}$ focuses on learning the $\phi$ -specific features. Moreover, owing to classification losses in Eq 4, $c^{2}_{t}$ is expected to highlight patterns shared between $\phi$ and $\varphi$ . Finally, these common visual patterns assist $f_{\phi}$ to classify $\phi$ correctly. During inference, we use the label with max confidence score in $f_{\phi}(\mathbb{n}_{T}(I_{i}))$ as the prediction of image $\phi\!=\!I_{i}$ .

Notably, AP based network can be constructed by simply replacing the standard convolutional layers in typical CNN with our AP-Conv layers, as shown in Fig. 2. In practice, the low-level features between $\phi$ and $\varphi$ can be directly shared with each other. In most cases, the performance of a typical CNN can be significantly improved by only replacing the last few standard Conv layers with AP-Conv.

3.2 Extensions for Augmentation Pathways

As shown in Table I, some augmentation policies have several choices of hyperparameters. Deep models are usually sensitive to these hyperparameters, since different augmentation hyperparameters for the same image may lead to a wide variety of appearances. Previous methods tend to find one proper hyperparameter according to expert knowledge or automatically searching results.

We found that common visual patterns exist among augmentation policy under different hyperparameters, and the shared feature space among them usually present dependencies. For example, the shared feature learned from Blur $(k\!=\!5)$ can benefit the recognition of image with Blur $(k\!<\!5)$ . For GridShuffle, some visual detail patterns learned from small grids can be reused to represent images with large grids. Thus we extend the augmentation pathways for handling augmentation policy under various hyperparameter settings. We rank the hyperparameters of augmentation according to their distribution similarities to the original training image, and then feed the images augmented with different hyperparameters into different pathways in a high-order (nested) manner. In this way, our high-order AP can gather and structure information from augmentations with various hyperparameters.

Extension-1: High-order Homogeneous Augmentation Pathways We extend the basic augmentation pathway into high-order to mine shared visual patterns in different levels. Take GridShuffle as an example, we choose two different hyper-parameters to generate augmented image $\varphi\!=$ GridShuffle $(g\!=\!2)$ and $\varphi^{\prime}\!=$ GridShuffle $(g\!=\!7)$ . The images augmented by GridShuffle are expected to learn visual patterns inner grids, since the positions of all grids in image have been shuffled [11]. Considering grids in $\varphi^{\prime}$ are smaller than $\phi$ and grids in $\varphi$ , the local detail features learned from $\varphi^{\prime}$ can be reused in $\varphi$ and $\phi$ . We propose a convolution with 3rd-order homogeneous augmentation pathways (AP³-Conv), which consists of three homogeneous convolutions $c^{1}_{t}$ , $c^{2}_{t}$ , and $c^{3}_{t}$ for handling different inputs. Similar to the basic AP-Conv, $c^{1}_{t}$ is the main augmentation pathway targeting at light augmentations $\phi$ -specific feature, while augmentation pathway $c^{2}_{t}$ and $c^{3}_{t}$ are designed for learning the shared visual patterns of $\{\phi,\varphi\}$ and $\{\phi,\varphi$ , $\varphi^{\prime}\}$ , respectively. The operation of AP³-Conv can be formulated as:

\begin{split}\mathbb{n}_{t}(\phi_{i})&=c^{1}_{t}(\phi_{i})+\!\!\!\!+\,c^{2}_{t}(\phi_{i})+\!\!\!\!+\,c^{3}_{t}(\phi_{i}),\\ \mathbb{n}_{t}(\varphi_{i})&=c^{2}_{t}(\varphi_{i})+\!\!\!\!+\,c^{3}_{t}(\varphi_{i}),~{}~{}~{}~{}~{}~{}~{}\mathbb{n}_{t}(\varphi^{\prime}_{i})=c^{3}_{t}(\varphi^{\prime}_{i}).\end{split}

(5)

In general, the standard convolution $c_{t}^{j}(x)$ can be defined as an operation filtering information from the $j$ -th to the last neural pathways,

c_{t}^{j}(x)=W^{1}_{t}\big{(}c^{j}_{t-1}(x)+\!\!\!\!+\,c^{j+1}_{t-1}(x)...+\!\!\!\!+\,c^{k}_{t-1}(x)\big{)}+b^{k}_{t},

(6)

where $1\!\leq\!j\!\leq\!k$ , $k$ is the count of neural pathways in total. For AP³-Conv, we set $k\!=\!3$ . $c^{1}_{t}$ takes the outputs of $c^{1}_{t-1}$ , $c^{2}_{t-1}$ , $c^{3}_{t-1}$ as inputs, while $c^{2}_{t}$ takes the outputs of $c^{2}_{t-1}$ , $c^{3}_{t-1}$ as inputs. In this way, the dependency across $\phi$ , $\varphi$ and $\varphi^{\prime}$ can be built. Fig. 4 indicates a network with 3rd-order homogeneous augmentation pathways (AP³) handling two different hyperparameters for GridShuffle, whose objective function is defined as:

$\displaystyle\mathcal{L}_{cls}=\sum_{i=1}^{N}$	$\displaystyle\mathcal{L}\left(f_{\phi}(\mathbb{n}_{T}(\phi_{i})),l_{i}\right)+\mathcal{L}\left(f_{\varphi}(\mathbb{n}_{T}(\varphi_{i})),l_{i}\right)$
	$\displaystyle+\mathcal{L}\left(f_{\varphi}^{\prime}(\mathbb{n}_{T}(\varphi^{\prime}_{i})),l_{i}\right)+\lambda S_{i},$	(7)
$\displaystyle S_{i}=\sum_{t=1}^{T}$	$\displaystyle\left\langle c^{1}_{t}(\phi_{i}),c^{2}_{t}(\phi_{i}),c^{3}_{t}(\phi_{i})\right\rangle+\left\langle c^{2}_{t}(\varphi_{i}),c^{3}_{t}(\varphi_{i})\right\rangle.$

The original image $\phi\!=\!I_{i}$ is predicted by $f_{\phi}(\mathbb{n}_{T}(I_{i}))$ during inference.

By analogy, we can design higher-order augmentation pathways network of $k$ different homogeneous dataflow pathways, for handling $k\!-\!1$ different settings of a given heavy data augmentation policy. In general, our high-order AP^k-Conv can handle various settings of the given augmentation and collect useful visual patterns in different levels. At last, all features are integrated in a dependency manner and results in well-structured feature space for original image classification.

Extension-2: High-order Heterogeneous Augmentation Pathways We have adapted homogeneous neural pathways and loss functions for various hyperparameters of given heavy data augmentation in a high-order augmentation pathway network. The basic structure and settings (e.g., kernel sizes, strides in each sub-convolutional layer) of these neural pathways are the same in AP^k. However, images augmented using different hyperparameters may have different characteristics, which is a reasonable motivation for customizing the basic settings of neural pathways for inputs with different properties. Again we take GridShuffle as an example, higher-resolution representations are more suitable for learning from detailed features in smaller grids. It means that the neural pathway consists of convolutions with larger feature map outputs that would be more friendly to GridShuffle with a larger $g$ .

Here we introduce another high-order extension of basic augmentation pathways for integrating representations learned from heterogeneous augmentation pathways for different characteristics. Fig. 5 shows the pipeline of a 4th-order heterogeneous augmentation pathways (HeAP⁴) based network with heavy augmentation in three different settings GridShuffle $(g\!=\!2,4,7)$ . Similar to the architecture of HRNet [29, 30], different neural pathways are configured with convolutions with different kernel sizes and channel sizes and result in feature maps in different resolutions. The augmentation pathway in green color is shared among all pathways since detailed visual patterns inner grids of GridShuffle $(g\!=\!7)$ is useful for the classification of all other inputs. Four-resolution feature maps are fed into the main pathway in a nested way during inference of the original image. We apply convolution-based downsample for zooming out the feature maps to its dependent ones. Our heterogeneous neural pathway based convolutions are used for integrating features learned from different augmentations. Each neural pathway is followed by one specific classification head. The objective function of HeAP⁴ network is the same as the 4th-order homogeneous augmentation pathways network.

TABLE II: The performance on ImageNet / #Parameters / MACs on ResNet, iResNet, ResNeXt, MobileNet V2, ConvNeXt and their basic Augmentation Pathways (AP) version on given additional heavy augmentation policy RandAugment (for generating

\varphi

). repro: our reproduction of each method with their original augmentation settings.

Metrics	Method	ResNet-50	ResNeXt-50 32x4d	MobileNetV2	iResNet-50	ConvNeXt-Tiny
	repro.	25.6M	25.0M	3.5M	25.6M	28.6M
#Params.	w/ AP	21.8M	21.4M	3.3M	21.8M	25.5M
	repro.	4.11G	4.27G	0.32G	4.15G	4.47G
MACs	w/ AP	3.91G	4.06G	0.30G	3.95G	4.30G
	repro.	76.19 / 93.13	77.48 / 93.66	71.97 / 90.37	77.59 / 93.55	81.98 / 95.88
	w/ $\varphi$	77.12 / 93.45	77.67 / 93.76	72.04 / 90.38	77.20 / 93.52	81.56 / 95.75
Acc.(%)	w/ AP	77.97 / 93.92	78.18 / 94.07	72.34 / 90.48	78.20 / 93.95	82.23 / 96.01

4 ImageNet Experiments and Results

We evaluate our proposed method on ImageNet [31] dataset (ILSVRC-2012), due to its widespread usage in supervised image recognition. Since the main purpose of data augmentation is to prevent overfitting, we also construct two smaller datasets from the training set of ImageNet by randomly sampling 100 and 20 images for each class, named ImageNet₁₀₀ and ImageNet₂₀. ImageNet₁₀₀ is also used for ablation studies in this paper.

We apply augmentation pathways on six widely used backbone networks covering typical ConvNet developments from 2015 to 2022, including:

•

ResNet [14] (2015), stacking residual and non-linear blocks.
•

ResNeXt [32] (2017), repeating blocks that aggregates a set of transformations with the same topology.
•

MobileNetV2 [33] (2018), mobile architecture based on the inverted residuals and linear bottlenecks.
•

HRNet [30] (2019), exchanging information across steams with different resolutions.
•

iResNet [20] (2020), using ResGroup blocks with group convolutional layers, improved information flow and projection shortcut.
•

ConvNeXt [34] (2022), designed for “modernizing” ConvNet toward the design of a vision Transformer (e.g. Swin-T).

Single central-crop testing accuracies on the ImageNet validation set are applied as the evaluation metric for all experiments.

4.1 Implementation Details

Following standard practices [14, 2, 35], we perform standard (light) data augmentation with random cropping 224 $\times$ 224 pixels and random horizontal flipping for all baseline methods except ConvNeXt. Same with the original setting of ConvNeXt [34] training implementation¹¹1https://github.com/facebookresearch/ConvNeXt, we adopt schemes including Mixup, Cutmix, RandAugment, and Random Erasing as the light augmentations policies for ConvNeXt models. All other hyperparameters are consistent with each method’s default settings. The augmentation pathways version of baseline methods is designed by replacing all standard convolutional layers in the last stage [14, 35] (whose input size is $14\!\times\!14$ , and output feature map size is $7\!\times\!7$ ) by AP^k-Conv. We set the input and output channel sizes of each sub-convolution $c^{1},c^{2},...,c^{k}$ in AP^k-Conv as $1/k$ of the input and output channel size in the replaced standard convolutional layer, respectively. For architectures containing group convolution layers, e.g. ResNeXt, MobileNetV2 and ConvNeXt, we remain the number of groups of each convolution inner every AP^k-Conv to be the same with its corresponding original group convolution layer. For HeAP networks, we equip heterogeneous augmentation pathways after each stage. More implementation details can be found in our released source code²²2https://github.com/ap-conv/ap-net.

4.2 Performance Comparison

Following the settings of other heavy augmentation related works [10, 6], we firstly apply RandAugment with hyperparameter $m\!=\!9$ , $n\!=\!2$ for generating heavy augmented view $\varphi$ . The experimental results on different network architectures are reported in Table II. Our proposed AP consistently benefits all these ConvNets with fewer model parameters and lower inference computational cost. It can be found that the RandAugment policy searched for ResNet-50 architecture results in a performance drop on iResNet-50³³3https://github.com/iduta/iresnet. While our augmentation pathways stably improve all architectures. The performance improvement of MobileNetV2 w/ AP is not as significant as the results on other architectures. It is mainly due to the limited parameters of MobileNetV2 bounded its feature representation ability and restricted the capacity of visual patterns from various augmented views. Besides, since we apply additional RandAugment policy based on the lightly augmented view $\phi$ to generate the heavier augmented view $\varphi$ for ConvNeXt, using RandAugment twice results in performance degradation on ConvNeXt-Tiny. However, our AP can still aggregate information beneficial to the classification task from the heavier augmented view $\varphi$ . These experimental results demonstrate the robustness and generality of AP.

TABLE III: Performance comparison on ImageNet subsets. AP-ResNet achieves significant improvements with different heavy data augmentation policies.

Augmentation	Model	ImageNet₁₀₀	ImageNet₂₀
Random Crop,Flip	ResNet	45.01 / 70.04	9.59 / 23.75
GridShuffle	ResNet	43.95 / 68.97	9.88 / 23.81
GridShuffle	AP-ResNet	45.62 / 70.93	11.53 / 27.85
MPN	ResNet	45.51 / 70.78	10.64 / 25.36
MPN	AP-ResNet	46.98 / 71.64	11.14 / 26.57
Gray	ResNet	45.83 / 71.08	9.63 / 24.49
Gray	AP-ResNet	46.83 / 72.01	11.68 / 27.85
RandAugment	ResNet	51.75 / 75.66	17.59 / 37.06
RandAugment	AP-ResNet	53.74 / 76.83	20.80 / 40.86

AP on Fewer Labels We also applied augmentation pathways in small datasets ImageNet₁₀₀ and ImageNet₂₀ to test on the practical scenario of data scarcity. We selected three manually designed heavy data augmentations GridShuffle $(g\!=\!7)$ , Gray $(\alpha\!=\!1)$ , MPN $(s\!=\!1.5)$ and RandAugment $(m\!=9,n\!=2)$ besides light augmentations. The experimental results are reported in Table III. We can find that AP-Net significantly boosts the performance on small datasets. Note this is practically useful when training data is expensive to obtain.

High-order Homogeneous Augmentation Pathways In Table IV, we compare the results from the standard ResNet-50, its basic AP version, and 3rd-order version AP³. In detail, our 3rd-order augmentation pathway is designed for adapting two RandAugment with different hyper-parameters. We find that AP³ can further improve the performance of the 2nd-order basic AP-Conv based network. The significant gains as introducing more different hyper-parameters indicate that structuring the subdivision of generalities among different features spaces in a dependent manner benefits the object recognition.

$m_{t}=$	$\frac{1}{2}n_{t}$	$\frac{2}{3}n_{t}$	$n_{t}$
Acc.	45.59 $\pm$ 0.13	45.53 $\pm$ 0.11	43.95 $\pm$ 0.11

High-order Heterogeneous Augmentation Pathways Following the framework described in Fig. 5, we adapt an HRNet-W44-C [30] style network architecture for 4th-order heterogeneous augmentation pathways network by replacing all multi-resolution convolution with HeAP⁴-Conv. Unlike the HRNet, which can only pass one image once, its HeAP⁴ variant can handle four different inputs simultaneously. The hierarchical classification head of HRNet is disabled in HeAP⁴. Four parallel loss functions follow four different neural pathways in HeAP⁴-HRNet. Only the neural pathway for lightly augmented inputs is activated during inference. Table IV summarizes the classification results of HRNet and our HeAP⁴-HRNet. HeAP⁴-HRNet significantly outperforms HRNet on ImageNet₁₀₀ with fewer parameters and lower computational cost. Recall that HeAP⁴-HRNet and HRNet are two different architectures due to the completely different data flow, HeAP convolutional layers, and classification heads.

4.3 Discussions

To evaluate the statistical significance and stability of the proposed method, we report the mean and standard deviation of the accuracy from five trials for all below ablation experiments on ImageNet₁₀₀.

Impact of the Cross Pathways Connections We design ablation studies by removing cross-pathways connections (w/o feature sharing among pathways) in AP-Conv but remaining the loss functions in Eq. (4) and Eq. (3.2) (as shown in Fig. 4.2). For standard ConvNet, heavily augmented views can directly influence the training of all parameters. However for AP-Net w/o sharing weights, heavily augmented views can only affect a half set of parameters’ training (if we set $m_{t}\!=\!\frac{1}{2}n_{t}$ as default).

The results in Table VI show that (1) our proposed loss function leads to +0.87% improvement over baselines, and (2) AP-style architecture further boost 1.18% gain, due to the visual commonality learned among pathways.

Moreover, Table V shows that increasing the influence of heavily augmented views leads to performance drop (ConvNet is equal to AP-Net w/o sharing weight when $m_{t}=n_{t}$ ). Such phenomenon is owing to the irrelevant feature bias introduced by the heavy augmentations. The divided pathways design can suppress such irrelevance.

TABLE VI: The effect of removing cross pathways connections, and randomly feeding inputs to different pathways. Heavy augmentation is RandAugment.

Method	ImageNet₁₀₀
ResNet-50	51.69 $\pm$ 0.09
AP-ResNet-50 w/o sharing feature	52.58 $\pm$ 0.11
AP-ResNet-50 w/ randomly input	52.80 $\pm$ 0.14
AP-ResNet-50	53.76 $\pm$ 0.08

Impact of Distortion Magnitudes of Augmentations The experimental results in Fig. 7 shows that our AP method can stably boosts the performance of ConvNet under various hyperparameters for RandAugment.

Impact of Cross Pathways Regularization $S$ To demonstrate the effects of $S$ , we perform the regularization item separation experiments on AP-ResNet-50 with RandAugment. The results are shown in Table VII. We also compared the AP-ResNet-50 performance by applying different settings of $\lambda=n\times\omega$ for evaluating AP-Net’s sensitivity to the choice of $\lambda$ . It shows that cross pathways regularization benefits the feature space structure across different neural pathways, resulting in better performance. But too high loss weight for $S$ would lead to a performance drop, behaving similarly to the standard weight decay in the common neural network training.

TABLE VII: The impact of cross pathways regularization term

S

and its weight for AP-ResNet-50 with RandAugment.

$\lambda$	$10\omega$	$\omega$	$0.1\omega$	$0.01\omega$	$0$ (w/o $S$ )
Acc.	52.86 $\pm$ 0.09	53.14 $\pm$ 0.08	53.76 $\pm$ 0.08	53.45 $\pm$ 0.10	53.19 $\pm$ 0.13

Generalize the ‘light vs. heavy” Augmentation Policy Settings to “basic vs. heavier” Inspired by the related work [6], defining $d$ as the deviation of augmented view from the original view, given two augmented view $\phi$ and $\varphi$ , we denote $\varphi$ is heavier than $\phi$ only if $d(\varphi)>d(\phi)$ . There are two situations to adjudge $d(\varphi)>d(\phi)$ : 1) $\varphi$ and $\phi$ are augmented by the same policies, but $\varphi$ is augmented with more aggressive hyperparameter. 2) $\varphi$ is augmented by policies which is a proper superset of augmentations used for generating $\phi$ . In AP, the basic view $\phi$ and the heavier view $\varphi$ are fed to the main and augmentation pathway, respectively.

It means some heavy augmentation policies may generate basic view $\phi$ , e.g. ConvNeXt applies the combination of Random Crop, Mixup, Cutmix, RandAugment, and Random Erasing as basic augmentations for generating $\phi$ . We can introduce another RandAugment on $\phi$ to generate heavier view $\varphi$ for ConvNeXt. The experimental results in Table II show that AP-ConvNeXt-Tiny with twice RandAugment outperforms ConvNeXt-Tiny.

Accordingly, heavier view $\varphi$ can be generated by applying additional light augmentation, e.g. we can apply another crop operation based on $\phi$ to generate the heavier view $\varphi$ (simulating the aggressive crop operation), and it still results in performance improvement, as shown in Table VIII.

TABLE VIII: Accuracy after introducing aggressive crop operation.

Method	Augmentation	ImageNet₁₀₀
ResNet-50	Standard Crop	44.98 $\pm$ 0.10
ResNet-50	Aggressive Crop	50.07 $\pm$ 0.12
AP-ResNet-50	Aggressive Crop	52.46 $\pm$ 0.09

Model Inference The augmented pathways are designed to stabilize main-pathway training when heavy data augmentations are present. During inference, no heavy augmentation are adopted, only $f_{\phi}$ in the main neural pathway for the original image are used for computing probability.

Model Complexity Although AP usually takes more memory cost during model training than the standard ConvNet, many connections can be cut out while replacing traditional convolutions with AP-Convs. Thus the AP version of a given standard CNN network has fewer parameters (#Params.) to learn and lower computational cost (GMACs, Multiply-Accumulate Operations) during inference, as specified in Tables II, IV and Eq. (3)..

5 Conclusion

The core concepts of our proposed Augmentation Pathways for stabilizing training with data augmentation can be concluded as: 1) Adapting different neural pathways for inputs with different characteristics. 2) Integrating shared features by considering visual dependencies among different inputs. Two extensions of AP are also introduced for handling data augmentations in various hyper-parameters. In general, our AP based network is more efficient than traditional CNN with fewer parameters and lower computational cost, and results in stable performance improvement on various datasets on a wide range of data augmentation polices.

Acknowledgments

This work was supported by the National Key R&D Program of China under Grand No.2020AAA0103800.

References

[1] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning (still) requires rethinking generalization,” Communications of the ACM, vol. 64, no. 3, pp. 107–115, 2021.
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
[3] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
[4] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning. PMLR, 2020, pp. 1597–1607.
[5] A. B. Jung, K. Wada, J. Crall, S. Tanaka, J. Graving, C. Reinders, S. Yadav, J. Banerjee, G. Vecsei, A. Kraft, Z. Rui, J. Borovec, C. Vallentin, S. Zhydenko, K. Pfeiffer, B. Cook, I. Fernández, F.-M. De Rainville, C.-H. Weng, A. Ayala-Acevedo, R. Meudec, M. Laporte et al., “imgaug,” https://github.com/aleju/imgaug, 2020, online; accessed 01-Feb-2020.
[6] Y. Bai, Y. Yang, W. Zhang, and T. Mei, “Directional self-supervised learning for heavy image augmentations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 692–16 701.
[7] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[8] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” arXiv preprint arXiv:1611.05431, 2016.
[9] X. Li, W. Wang, X. Hu, and J. Yang, “Selective kernel networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2019, pp. 510–519.
[10] X. Wang and G.-J. Qi, “Contrastive learning with stronger augmentations,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–12, 2022.
[11] Y. Chen, Y. Bai, W. Zhang, and T. Mei, “Destruction and construction learning for fine-grained image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 5157–5166.
[12] T. DeVries and G. W. Taylor, “Improved regularization of convolutional neural networks with cutout,” arXiv preprint arXiv:1708.04552, 2017.
[13] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le, “Randaugment: Practical automated data augmentation with a reduced search space,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 702–703.
[14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[15] A. Hernández-García, J. Mehrer, N. Kriegeskorte, P. König, and T. C. Kietzmann, “Deep neural networks trained with heavier data augmentation learn features closer to representations in hit,” in Conference on Cognitive Computational Neuroscience, vol. 1, 2018.
[16] A. Hernández-García, P. König, and T. C. Kietzmann, “Learning robust visual representations using data augmentation invariance,” arXiv preprint arXiv:1906.04547, 2019.
[17] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le, “Autoaugment: Learning augmentation strategies from data,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2019, pp. 113–123.
[18] S. Lim, I. Kim, T. Kim, C. Kim, and S. Kim, “Fast autoaugment,” in Advances in Neural Information Processing Systems, 2019, pp. 6662–6672.
[19] R. Hataya, J. Zdenek, K. Yoshizoe, and H. Nakayama, “Faster autoaugment: Learning augmentation strategies using backpropagation,” arXiv preprint arXiv:1911.06987, 2019.
[20] I. C. Duta, L. Liu, F. Zhu, and L. Shao, “Improved residual networks for image and video recognition,” arXiv preprint arXiv:2004.04989, 2020.
[21] D. Ho, E. Liang, I. Stoica, P. Abbeel, and X. Chen, “Population based augmentation: Efficient learning of augmentation policy schedules,” in ICML, 2019.
[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, 2017.
[23] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu et al., “Mmdetection: Open mmlab detection toolbox and benchmark,” arXiv preprint arXiv:1906.07155, 2019.
[24] R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollár, and K. He, “Detectron,” https://github.com/facebookresearch/detectron, 2018.
[25] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” arXiv preprint arXiv:1710.09412, 2017.
[26] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix: Regularization strategy to train strong classifiers with localizable features,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 6023–6032.
[27] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evolution for image classifier architecture search,” in Proceedings of the aaai conference on artificial intelligence, vol. 33, 2019, pp. 4780–4789.
[28] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures for scalable image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8697–8710.
[29] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in CVPR, 2019.
[30] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang, W. Liu, and B. Xiao, “Deep high-resolution representation learning for visual recognition,” TPAMI, 2019.
[31] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
[32] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1492–1500.
[33] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510–4520.
[34] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11 976–11 986.
[35] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.

Yalong Bai is a Senior Researcher at JD.com. He received his Ph.D. degree in Harbin Institute of Technology and Microsoft Research Asia Joint Ph.D. Education Program at 2018. His research interests include representation learning, multimodal retrieval, visual question answering, and visual commonsense reasoning. He has won first place in several international challenges on CVPR, ICME and MM. He has also served as the Area Chair for ACM MM Challenge and ICASSP.

Mohan Zhou is currently a Ph.D. student at Harbin Institute of Technology, China, under the supervision of Prof. Tiejun Zhao. Meanwhile, he also works as a research intern at JD Explore Academy. Before that, he received his B.Eng. degree also from Harbin Institute of Technology in 2021. His current research interest includes representation learning and multilearning. He achieved impressive results in several fine-grained image classification competitions organized by CVPR workshop.

Wei Zhang is now a Senior Researcher at JD.com, Beijing, China. He received his Ph.D degree from the Department of Computer Science in City University of Hong Kong. His research interests include computer vision and multimedia, especially visual recognition and generation. He has won the Best Demo Awards in ACM MM 2021, and served as the Area Chair for ICME, ICASSP, and Technical Program Chair for ACM MM Asia 2023.

Bowen Zhou (Fellow, IEEE) has been the President of Artificial Intelligence Platform & Research of JD.com since September 2017. Bowen is a technologist and business leader of human language technologies, machine learning, and artificial intelligence. Prior to joining JD.com, Dr. Zhou held several key leadership positions during his 15-year tenure at IBM Research’s headquarters. He previously served as a member of the IEEE Speech and Language Technical Committee, Associate Editor of IEEE Transactions, ICASSP Area Chair (2011-2015), ACL, and NAACL Area Chair.

Tao Mei (Fellow, IEEE) is a vice president with JD.COM and the deputy managing director of JD Explore Academy, where he also serves as the director of Computer Vision and Multimedia Lab. Prior to joining JD.COM in 2018, he was a senior research manager with Microsoft Research Asia in Beijing, China. He has authored or coauthored more than 200 publications (with 12 best paper awards) in journals and conferences, 10 book chapters, and edited five books. He holds more than 25 U.S. and international patents. He is a fellow of IAPR (2016), a distinguished scientist of ACM (2016), and a distinguished Industry Speaker of IEEE Signal Processing Society (2017).

Method	#Params.	MACs	Augmentation	ImageNet₁₀₀	ImageNet
ResNet [14, 13]	25.6M	4.11G	Baseline	45.01 / 70.04	76.64 / 93.24
ResNet [14, 13]	25.6M	4.11G	RandAugment²	51.67 / 75.45	77.03 / 93.41
AP-ResNet	21.8M	3.91G	RandAugment²	53.58 / 76.61	77.59 / 93.68
AP³-ResNet	20.6M	3.84G	RandAugment²	54.08 / 77.11	78.06 / 93.92
HRNet [30]	67.1M	14.93G	Baseline	51.53 / 75.58	78.81 / 94.41
HRNet [30]	67.1M	14.93G	RandAugment³	53.52 / 77.54	77.28 / 93.95
HeAP⁴-HRNet	59.9M	13.97G	RandAugment³	54.35 / 78.24	79.25 / 94.78

Augmentation Pathways Network for Visual Recognition

Abstract

Index Terms:

1 Introduction

2 Related Work

3 Methodology

3.1 Augmentation Pathways (AP)

3.2 Extensions for Augmentation Pathways

4 ImageNet Experiments and Results

4.1 Implementation Details

4.2 Performance Comparison

4.3 Discussions

5 Conclusion

Acknowledgments

References

Augmentation Pathways Network
for Visual Recognition