This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Augmentation Pathways Network
for Visual Recognition

Yalong Bai, Mohan Zhou, Wei Zhang, Bowen Zhou,  and Tao Mei,  Yalong Bai, Wei Zhang, Bowen Zhou, Tao Mei are with JD Explore Academy, Beijing, China, 100010.
E-mail: [email protected] Mohan Zhou is with Harbin Institute of Technology.
Abstract

Data augmentation is practically helpful for visual recognition, especially at the time of data scarcity. However, such success is only limited to quite a few light augmentations (e.g., random crop, flip). Heavy augmentations are either unstable or show adverse effects during training, owing to the big gap between the original and augmented images. This paper introduces a novel network design, noted as Augmentation Pathways (AP), to systematically stabilize training on a much wider range of augmentation policies. Notably, AP tames various heavy data augmentations and stably boosts performance without a careful selection among augmentation policies. Unlike traditional single pathway, augmented images are processed in different neural paths. The main pathway handles the light augmentations, while other pathways focus on the heavier augmentations. By interacting with multiple paths in a dependent manner, the backbone network robustly learns from shared visual patterns among augmentations, and suppresses the side effect of heavy augmentations at the same time. Furthermore, we extend AP to high-order versions for high-order scenarios, demonstrating its robustness and flexibility in practical usage. Experimental results on ImageNet demonstrate the compatibility and effectiveness on a much wider range of augmentations, while consuming fewer parameters and lower computational costs at inference time.

Index Terms:
Visual Recognition, Data Augmentation, Neural Network Design, Augmentation Pathways Network.

1 Introduction

Deep convolutional neural networks (CNN) have achieved remarkable progress on visual recognition. In some cases, deep models are likely to overfit the training data as well as its noisy signals [1], even on a large-scale dataset such as ImageNet [2, 3]. Data augmentation usually serves as a standard technique for regularizing the training process and reducing the generalization error, especially when data annotations are scarce.

However, such successes in data augmentation are only restricted to a handful of augmentations that slightly jitters the original image. A large collection of augmentation operations can not be easily applied to arbitrary configurations (e.g., datasets, backbones, hyper-parameters). Sometimes data augmentation only shows marginal or even adverse effects on image classification. Following the definition in prior works (e.g., SimCLR [4], imgaug toolkit [5], DSSL [6]), we roughly group augmentation operations into two categories (Fig. 1 left). 1) Light Augmentation that only slightly modifies an image without significant information loss. Typical operations include random Flip, Crop [2, 7, 8, 9]. Note that the original image can also be treated as a special case of light augmentation (i.e., Identity). 2) Heavy Augmentation (or named Strong Augmentation [10]) that largely alters the image appearance, sometimes striping out a significant amount of information (such as color, object structure). Typical operations include Gray (transforming color image to grayscale), GridShuffle [11] (destructing object structures by shuffling image grids) and CutOut [12] (masking out random area of image), etc.

Refer to caption
Figure 1: Left: Examples of original images and their lightly augmented (randomly Resize, Crop, Flip) and heavily augmented (Gray, GridShuffle, RandAugment) versions. Middle: Improvement on Top-1 accuracy by applying two heavy augmentations (Gray and GridShuffle) on ImageNet and its subsets (ImageNetn, nn indicates the number of images used per category). Standard network (ResNet-50) performs quite unstable, showing marginal or adverse effects. Right: Improvement on Top-1 accuracy by applying searched augmentation (RandAugment [13]: A collection of randomly selected heavy augmentations) on ImageNet. Augmentation policy searched for ResNet-50 leads to performance drop on iResNet-50. In contrast, Augmentation Pathways (AP) based network can steadily benefit from a much wider range of augmentation policies for robust classification.

Based on prior studies [2, 7, 14], light augmentations have demonstrated stable performance improvements, since lightly augmented images usually share very similar visual patterns with the original ones. However, heavy augmentations inevitably introduce noisy feature patterns, following different distributions with the original samples. Thus training directly with these images are often unstable, sometimes showing adverse effect in performance. For example in Fig. 1 (Middle), GridShuffle is highly unstable on ImageNet, if trained with standard network (see ResNet column). This may be due to the implicit gap among three sets of “train, augmented, test” data.

Intuitively, heavy augmentations also introduce helpful and complementary information during training [11]. Recent studies [15, 16] also suggest that networks trained with heavier augmentation yield representations that are more similar between deep neural networks and human brain. However, heavy augmentation tends to generate images with larger variations from the original feature space. Such variations are not always helpful, since irrelevant feature bias is also introduced alongside the augmentation. From the opposite view, there is still useful information implied in the shared visual patterns between the original and heavily augmented images. For example, contour information is augmented, but color bias is introduced in Gray augmentation; visual details are augmented, while object structure is destroyed in GridShuffle augmentation [11]. Therefore, expertise and knowledge are required to select feasible data augmentation policies [11]. In most cases, this is quite cumbersome. Even when augmentation improvements have been found for one specific domain or dataset, they often do not transfer well to other datasets. Some previous works employ search algorithms or adversarial learning to automatically find suitable augmentation policies [17, 18, 19, 13]. However, such methods require additional computation to obtain suitable policies. Moreover, augmentation policies searched for one setting are usually difficult to fit other settings. For example in Fig. 1 (Right), RandAugment [13] searched for ResNet leads to slight performance drop in iResNet [20] (an information flow version of ResNet).

In this work, we design a network architecture to handle a wide range of data augmentation policies, rather than adapt augmentation policies for specific datasets or architectures. A plug-and-play “Augmentation Pathways” (AP) is proposed for restructuring the neural paths by discriminating different augmentation policies. Specifically, a novel augmentation pathway based convolution layer (AP-Conv) is designed to replace standard Conv layer to stabilize training with a wide range of augmentations. As an alternative to the standard convolutional layer, AP-Conv adapts network design to a much wider range of heavy data augmentations. As illustrated in Fig. 2, traditional convolutional neural networks directly feed all images into the same model. In contrast, our AP-Conv (right of Fig. 2) process the lightly and heavily augmented images through different neural pathways. Precisely, a basic AP-Conv layer consists of two convolutional pathways: 1) the main pathway focuses on light augmentations, and 2) the augmentation path is shared among lightly and heavily augmented images for learning common representations for recognition. Two pathways interact with each other through the shared feature channels. To further regularize the feature space, we also propose an orthogonal constraint to decouple features learned from different pathways. Notably, our AP-Conv highlights the beneficial information shared between pathways and suppresses negative variations from heavy data augmentation. In this way, the Augmentation Pathways network can be naturally adapted to different data augmentation policies, including manually designed and auto-searched augmentations.

Refer to caption
Figure 2: Illustration of standard CNN (Left) and our proposed Augmentation Pathways network (Right) for handling data augmentations. Details of the basic AP-Conv in purple dashed box is illustrated in Fig. 3.

Furthermore, different augmentation hyperparameters may lead to different visual appearances and classification accuracy. Tuning such hyperparameters is non-trivial. Some works propose to automatically search for a proper hyperparameter. However, these methods usually require additional computation or searching cost [17], and the learned augmentation policies are dataset or network dependent [18, 21]. Thus these methods are usually with limited generalization capability. To address this, we gather all useful information from one augmentation policy with various hyperparameters, instead of selecting one most appropriate hyperparameter as previous works did. Specifically, we extend the augmentation pathways into high-order for processing training data from multiple hyperparameter selections of data augmentation pass different pathways. In this way, the information dependencies among different hyperparameters of data augmentation policies can be well structured, and the information from different neural network pathways can be gathered to organize a well-structured and rich feature space.

Comparing to the standard convolutional layer, our AP-Conv contains fewer connections and parameters. Moreover, it is highly compatible with standard networks. AP-Conv based network can even be directly finetuned from the standard CNN. The experimental results on ImageNet dataset demonstrated AP-Conv’s efficiency and effectiveness by equipping manually designed heavy augmentations and the searched data augmentations collection.

2 Related Work

Manually designed augmentation Since data augmentation can increase the training data diversity without collecting additional samples, it usually plays an essential role in deep neural network based vision tasks and benefits the model generalization capability and performance improvement as a standard operation in deep vision model training. In general, light data augmentation policies, including random cropping, horizontal flips are commonly used in various tasks [14, 22, 23, 24]. Such data augmentation methods keep the augmented images in the original training set and lead to steady performance improvement in different neural network architectures trained on various datasets. Recently, heavy data augmentation methods have received more attention from the computer vision research community. Some methods [12, 25, 26] randomly erase image patches from the original image or replace the patches with random noise. GridShuffle [11] is proposed for destructing the global structure of the object in images and force the model to learn local detail features. However, such manually designed heavy data augmentation is dataset-specific and usually suffer from adapting to different datasets.

Searched augmentation Inspired by the successes of Neural Architecture Search algorithms on various computer vision tasks [27, 28], there are several current studies proposed for automatically search algorithms to obtain augmentation policies for given datasets and network architectures. These studies try to find the best augmentation policy collection from the predefined transformation functions by RL based strategy [17], Population based training [21], Bayesian optimization [18] or the latest grid search based algorithms [13]. Such methods usually takes lots of GPU hours for searching a proper data augmentation collection before training model. Moreover, theoretically, these data augmentation strategies are dataset specific and network architecture specific. These two limitations hurt the practical value of the searched-based data augmentation methods.

In the paper, we introduce a new viewpoint for the inter-dependency among dataset, network architecture, and data augmentation policies. Rather than selecting proper data augmentation policies for each dataset or network architecture, we propose a network architecture design method for dealing with various data augmentations, including not only the manually designed augmentation but also searched augmentation. With lower computational cost, our method can achieve stable performance improvements on various network architectures and datasets equipping different kinds of data augmentation methods.

3 Methodology

In this section, we start with a general description of the basic augmentation pathway (AP) network (Sec. 3.1), then introduce two extensions of AP (Sec. 3.2) for handling multiple hyper-parameters of given augmentation policy.

We focus on deep convolutional neural network (CNN) based fully supervised image classification problem. A typical CNN architecture consists of TT stacked convolutional layers {c1,c2,,cT}\{c_{1},c_{2},...,c_{T}\}, and a classifier ff. Given training image IiI_{i} with its category label lil_{i}, ϕi\phi_{i} denotes the lightly augmented version of IiI_{i}. Note that the original input image II can be regarded as a special case of ϕ\phi. The overall objective of a typical image classification network is to minimize:

cls=i=1N(f(cT(ϕi)),li),\begin{split}\mathcal{L}_{cls}&=\sum_{i=1}^{N}\mathcal{L}\left(f(c_{T}(\phi_{i})),l_{i}\right),\end{split} (1)

where ct(ϕi)=Wtct1(ϕi)+btc_{t}(\phi_{i})=W_{t}c_{t-1}(\phi_{i})+b_{t}, \mathcal{L} is the cross-entropy loss, Wtnt1×ht×wt×ntW_{t}\in\mathbb{R}^{n_{t-1}\!\times\!h_{t}\!\times\!w_{t}\!\times\!n_{t}}, btnt×1b_{t}\in\mathbb{R}^{n_{t}\!\times\!1} are the learnable parameters in ctc_{t} with kernel size ht×wth_{t}\!\times\!w_{t}, nt1n_{t-1} and ntn_{t} are the sizes of input and output channels of ctc_{t}, respectively.

3.1 Augmentation Pathways (AP)

We first introduce convolutional operations with augmentation pathways (AP-Conv), the basic unit of our proposed AP network architecture. Different from the standard convolution ctc_{t} (t=1,,Tt=1,...,T, denoting the layer index), AP version convolution 𝕟t\mathbb{n}_{t} consists of two convolutions ct1c^{1}_{t} and ct2c^{2}_{t}. ct1c^{1}_{t} is equipped in the main pathway, learning feature representations of lightly augmented input ϕ\phi (with similar distributions with original images). ct2c^{2}_{t} is the pathway to learn shared visual patterns between lightly augmented image ϕ\phi and heavily augmented image φ\varphi. φ\varphi varies from different data augmentation policies, and differs from the original original image distribution. The operations of a basic AP-Conv 𝕟t\mathbb{n}_{t} can be defined as:

𝕟t(ϕi)\displaystyle\mathbb{n}_{t}(\phi_{i}) =ct1(ϕi)++ct2(ϕi)\displaystyle=c^{1}_{t}(\phi_{i})+\!\!\!\!+\,c^{2}_{t}(\phi_{i})
=(Wt1𝕟t1(ϕi)+bt1)++(Wt2ct12(ϕi)+bt2),\displaystyle=\left(W^{1}_{t}\mathbb{n}_{t-1}(\phi_{i})\!+\!b^{1}_{t}\right)+\!\!\!\!+\,\left(W^{2}_{t}c^{2}_{t-1}(\phi_{i})\!+\!b^{2}_{t}\right),
𝕟t(φi)\displaystyle\mathbb{n}_{t}(\varphi_{i}) =ct2(φi)=Wt2ct12(φi)+bt2,\displaystyle=c^{2}_{t}(\varphi_{i})=W^{2}_{t}c^{2}_{t-1}(\varphi_{i})+b^{2}_{t}, (2)

where +++\!\!\!\!+\, indicates the vector concatenation operation, Wt1nt1×ht×wt×(ntmt)W^{1}_{t}\in\mathbb{R}^{n_{t-1}\!\times\!h_{t}\!\times\!w_{t}\!\times\!(n_{t}\!-\!m_{t})}, bt1(ntmt)×1b^{1}_{t}\in\mathbb{R}^{(n_{t}-m_{t})\!\times\!1} and Wt2mt1×ht×wt×mtW^{2}_{t}\in\mathcal{R}^{m_{t-1}\!\times\!h_{t}\!\times\!w_{t}\!\times\!m_{t}}, bt2mt×1b^{2}_{t}\in\mathbb{R}^{m_{t}\!\times\!1} represent the convolutional weights and biases of ct1c^{1}_{t} and ct2c^{2}_{t} respectively. mt1m_{t-1} and mtm_{t} denote the numbers of input and output channels of 𝕟t\mathbb{n}_{t} for processing heavily augmented inputs and lightly augmented inputs jointly, which is smaller than ntn_{t}. For light augmentation inputs, the output size of 𝕟t\mathbb{n}_{t} is same with ctc_{t}. As shown in Fig. 3, AP-Conv contains two different neural pathways inner one neural layer for ϕ\phi and φ\varphi respectively.

Refer to caption
Figure 3: The detailed structure of basic augmentation pathway based convolutional layer.
TABLE I: Examples of data augmentations with their hyperparameters. Gray, Blur, Gridshuffle, MPN are manually designed heavy augmentations. RandAugment is a searched augmentation combination including 14 different image transformations (e.g., Shear, Equalize, Solarize, Posterize, Rotate. Most of them are heavy transformations).
Augmentation Hyperparameter Description
Gray
the alpha value α[0,1]\alpha\in[0,1] of the grayscale image when overlayed
over the original image for Gray
α\alpha close to 1.0 means that mostly the
new grayscale image is visible
Blur
the kernel size kk of Blur
larger kk leads to more blurred image
GridShuffle
the number of grids g×gg\times g in image for GridShuffle
larger gg results in smaller grid and the
image is destructed more drastically
MPN
the scaling factor ss of pixel values for Multiplicative Noise
larger ss results in brighter image
RandAugment [13]
the number nn of augmentation transformations to apply sequentially,
and magnitude mm for all the transformations
larger nn and mm results in heavier
augmented image

Comparison to Standard Convolution   A standard convolution can be transformed into a basic AP-Conv by splitting an augmentation pathway and disabling a fraction of connections. In general, the number of parameters in 𝕟t\mathbb{n}_{t} is δt\delta_{t} less than a standard convolution under same settings, where

δt=(nt1mt1)×mt×ht×wt.\delta_{t}=(n_{t-1}-m_{t-1})\times m_{t}\times h_{t}\times w_{t}. (3)

For example, if we set mt=12ntm_{t}\!=\!\frac{1}{2}n_{t} and mt1=12nt1m_{t-1}\!=\!\frac{1}{2}n_{t-1}, AP-Conv only contains 75% parameters in the standard Conv.

The only additional operation in AP-Conv is a conditional statement to assign the features of ϕ\phi to ct1c^{1}_{t} and ct2c^{2}_{t}, or feed the features of φ\varphi to ct2c^{2}_{t}.

Augmentation Pathways based Network   The key idea of basic augmentation pathways based network is to mine the shared visual patterns between two pathways handling inputs following different distributions. A basic constraint is that the shared features should boost object classification, which is also common objective functions of two different neural pathways:

cls\displaystyle\mathcal{L}_{cls} =i=1N(fϕ(𝕟T(ϕi)),li)+(fφ(𝕟T(φi)),li)+λSi\displaystyle=\sum_{i=1}^{N}\mathcal{L}\left(f_{\phi}(\mathbb{n}_{T}(\phi_{i})),l_{i}\right)+\mathcal{L}\left(f_{\varphi}(\mathbb{n}_{T}(\varphi_{i})),l_{i}\right)+\lambda S_{i}
Si\displaystyle S_{i} =t=1Tct1(ϕi),ct2(ϕi),\displaystyle=\sum_{t=1}^{T}\left\langle c^{1}_{t}(\phi_{i}),c^{2}_{t}(\phi_{i})\right\rangle, (4)

where fϕf_{\phi} and fφf_{\varphi} are the classifiers for light and heavy augmentations respectively, SS is a Cross Pathways Regularization item to measure the similarity of visual patterns between neural pathways. The formulation of SS is similar to the standard weight decay. Both of them are L2 regularization. Denoting the loss weight of standard decay as ω\omega, for all experiments in our paper, we simply set λ=0.1ω\lambda=0.1\omega. Minimizing SiS_{i} penalizes filter redundancy in ct1c^{1}_{t} and ct2c^{2}_{t}. As a result, ct1c^{1}_{t} focuses on learning the ϕ\phi-specific features. Moreover, owing to classification losses in Eq 4, ct2c^{2}_{t} is expected to highlight patterns shared between ϕ\phi and φ\varphi. Finally, these common visual patterns assist fϕf_{\phi} to classify ϕ\phi correctly. During inference, we use the label with max confidence score in fϕ(𝕟T(Ii))f_{\phi}(\mathbb{n}_{T}(I_{i})) as the prediction of image ϕ=Ii\phi\!=\!I_{i}.

Notably, AP based network can be constructed by simply replacing the standard convolutional layers in typical CNN with our AP-Conv layers, as shown in Fig. 2. In practice, the low-level features between ϕ\phi and φ\varphi can be directly shared with each other. In most cases, the performance of a typical CNN can be significantly improved by only replacing the last few standard Conv layers with AP-Conv.

3.2 Extensions for Augmentation Pathways

As shown in Table I, some augmentation policies have several choices of hyperparameters. Deep models are usually sensitive to these hyperparameters, since different augmentation hyperparameters for the same image may lead to a wide variety of appearances. Previous methods tend to find one proper hyperparameter according to expert knowledge or automatically searching results.

We found that common visual patterns exist among augmentation policy under different hyperparameters, and the shared feature space among them usually present dependencies. For example, the shared feature learned from Blur(k=5)(k\!=\!5) can benefit the recognition of image with Blur(k<5)(k\!<\!5). For GridShuffle, some visual detail patterns learned from small grids can be reused to represent images with large grids. Thus we extend the augmentation pathways for handling augmentation policy under various hyperparameter settings. We rank the hyperparameters of augmentation according to their distribution similarities to the original training image, and then feed the images augmented with different hyperparameters into different pathways in a high-order (nested) manner. In this way, our high-order AP can gather and structure information from augmentations with various hyperparameters.

Refer to caption
Figure 4: The 3rd-order homogeneous augmentation pathways network is extended from the basic AP but handle heavy augmentations under two different hyperparameters (gg for Grid Shuffle) according to the visual feature dependencies among input images.
Refer to caption
Figure 5: The network architecture of our high-order heterogeneous augmentation pathways network. Four heterogeneous neural pathways (HeAP4) are responding to four different input images (lightly augmented images, GridShuffled images with g=(2,4,7)(2,4,7)). Note that only the main neural pathway in red color is activated during inference.

Extension-1: High-order Homogeneous Augmentation Pathways We extend the basic augmentation pathway into high-order to mine shared visual patterns in different levels. Take GridShuffle as an example, we choose two different hyper-parameters to generate augmented image φ=\varphi\!= GridShuffle(g=2)(g\!=\!2) and φ=\varphi^{\prime}\!= GridShuffle(g=7)(g\!=\!7). The images augmented by GridShuffle are expected to learn visual patterns inner grids, since the positions of all grids in image have been shuffled [11]. Considering grids in φ\varphi^{\prime} are smaller than ϕ\phi and grids in φ\varphi, the local detail features learned from φ\varphi^{\prime} can be reused in φ\varphi and ϕ\phi. We propose a convolution with 3rd-order homogeneous augmentation pathways (AP3-Conv), which consists of three homogeneous convolutions ct1c^{1}_{t}, ct2c^{2}_{t}, and ct3c^{3}_{t} for handling different inputs. Similar to the basic AP-Conv, ct1c^{1}_{t} is the main augmentation pathway targeting at light augmentations ϕ\phi-specific feature, while augmentation pathway ct2c^{2}_{t} and ct3c^{3}_{t} are designed for learning the shared visual patterns of {ϕ,φ}\{\phi,\varphi\} and {ϕ,φ\{\phi,\varphi, φ}\varphi^{\prime}\}, respectively. The operation of AP3-Conv can be formulated as:

𝕟t(ϕi)=ct1(ϕi)++ct2(ϕi)++ct3(ϕi),𝕟t(φi)=ct2(φi)++ct3(φi),𝕟t(φi)=ct3(φi).\begin{split}\mathbb{n}_{t}(\phi_{i})&=c^{1}_{t}(\phi_{i})+\!\!\!\!+\,c^{2}_{t}(\phi_{i})+\!\!\!\!+\,c^{3}_{t}(\phi_{i}),\\ \mathbb{n}_{t}(\varphi_{i})&=c^{2}_{t}(\varphi_{i})+\!\!\!\!+\,c^{3}_{t}(\varphi_{i}),~{}~{}~{}~{}~{}~{}~{}\mathbb{n}_{t}(\varphi^{\prime}_{i})=c^{3}_{t}(\varphi^{\prime}_{i}).\end{split} (5)

In general, the standard convolution ctj(x)c_{t}^{j}(x) can be defined as an operation filtering information from the jj-th to the last neural pathways,

ctj(x)=Wt1(ct1j(x)++ct1j+1(x)++ct1k(x))+btk,c_{t}^{j}(x)=W^{1}_{t}\big{(}c^{j}_{t-1}(x)+\!\!\!\!+\,c^{j+1}_{t-1}(x)...+\!\!\!\!+\,c^{k}_{t-1}(x)\big{)}+b^{k}_{t}, (6)

where 1jk1\!\leq\!j\!\leq\!k, kk is the count of neural pathways in total. For AP3-Conv, we set k=3k\!=\!3. ct1c^{1}_{t} takes the outputs of ct11c^{1}_{t-1}, ct12c^{2}_{t-1}, ct13c^{3}_{t-1} as inputs, while ct2c^{2}_{t} takes the outputs of ct12c^{2}_{t-1}, ct13c^{3}_{t-1} as inputs. In this way, the dependency across ϕ\phi, φ\varphi and φ\varphi^{\prime} can be built. Fig. 4 indicates a network with 3rd-order homogeneous augmentation pathways (AP3) handling two different hyperparameters for GridShuffle, whose objective function is defined as:

cls=i=1N\displaystyle\mathcal{L}_{cls}=\sum_{i=1}^{N} (fϕ(𝕟T(ϕi)),li)+(fφ(𝕟T(φi)),li)\displaystyle\mathcal{L}\left(f_{\phi}(\mathbb{n}_{T}(\phi_{i})),l_{i}\right)+\mathcal{L}\left(f_{\varphi}(\mathbb{n}_{T}(\varphi_{i})),l_{i}\right)
+(fφ(𝕟T(φi)),li)+λSi,\displaystyle+\mathcal{L}\left(f_{\varphi}^{\prime}(\mathbb{n}_{T}(\varphi^{\prime}_{i})),l_{i}\right)+\lambda S_{i}, (7)
Si=t=1T\displaystyle S_{i}=\sum_{t=1}^{T} ct1(ϕi),ct2(ϕi),ct3(ϕi)+ct2(φi),ct3(φi).\displaystyle\left\langle c^{1}_{t}(\phi_{i}),c^{2}_{t}(\phi_{i}),c^{3}_{t}(\phi_{i})\right\rangle+\left\langle c^{2}_{t}(\varphi_{i}),c^{3}_{t}(\varphi_{i})\right\rangle.

The original image ϕ=Ii\phi\!=\!I_{i} is predicted by fϕ(𝕟T(Ii))f_{\phi}(\mathbb{n}_{T}(I_{i})) during inference.

By analogy, we can design higher-order augmentation pathways network of kk different homogeneous dataflow pathways, for handling k1k\!-\!1 different settings of a given heavy data augmentation policy. In general, our high-order APk-Conv can handle various settings of the given augmentation and collect useful visual patterns in different levels. At last, all features are integrated in a dependency manner and results in well-structured feature space for original image classification.

Extension-2: High-order Heterogeneous Augmentation Pathways We have adapted homogeneous neural pathways and loss functions for various hyperparameters of given heavy data augmentation in a high-order augmentation pathway network. The basic structure and settings (e.g., kernel sizes, strides in each sub-convolutional layer) of these neural pathways are the same in APk. However, images augmented using different hyperparameters may have different characteristics, which is a reasonable motivation for customizing the basic settings of neural pathways for inputs with different properties. Again we take GridShuffle as an example, higher-resolution representations are more suitable for learning from detailed features in smaller grids. It means that the neural pathway consists of convolutions with larger feature map outputs that would be more friendly to GridShuffle with a larger gg.

Here we introduce another high-order extension of basic augmentation pathways for integrating representations learned from heterogeneous augmentation pathways for different characteristics. Fig. 5 shows the pipeline of a 4th-order heterogeneous augmentation pathways (HeAP4) based network with heavy augmentation in three different settings GridShuffle(g=2,4,7)(g\!=\!2,4,7). Similar to the architecture of HRNet [29, 30], different neural pathways are configured with convolutions with different kernel sizes and channel sizes and result in feature maps in different resolutions. The augmentation pathway in green color is shared among all pathways since detailed visual patterns inner grids of GridShuffle(g=7)(g\!=\!7) is useful for the classification of all other inputs. Four-resolution feature maps are fed into the main pathway in a nested way during inference of the original image. We apply convolution-based downsample for zooming out the feature maps to its dependent ones. Our heterogeneous neural pathway based convolutions are used for integrating features learned from different augmentations. Each neural pathway is followed by one specific classification head. The objective function of HeAP4 network is the same as the 4th-order homogeneous augmentation pathways network.

TABLE II: The performance on ImageNet / #Parameters / MACs on ResNet, iResNet, ResNeXt, MobileNet V2, ConvNeXt and their basic Augmentation Pathways (AP) version on given additional heavy augmentation policy RandAugment (for generating φ\varphi). repro: our reproduction of each method with their original augmentation settings.
Metrics Method ResNet-50 ResNeXt-50 32x4d MobileNetV2 iResNet-50 ConvNeXt-Tiny
repro. 25.6M 25.0M 3.5M 25.6M 28.6M
#Params. w/ AP 21.8M 21.4M 3.3M 21.8M 25.5M
repro. 4.11G 4.27G 0.32G 4.15G 4.47G
MACs w/ AP 3.91G 4.06G 0.30G 3.95G 4.30G
repro. 76.19 / 93.13 77.48 / 93.66 71.97 / 90.37 77.59 / 93.55 81.98 / 95.88
w/ φ\varphi 77.12 / 93.45 77.67 / 93.76 72.04 / 90.38 77.20 / 93.52 81.56 / 95.75
Acc.(%) w/ AP 77.97 / 93.92 78.18 / 94.07 72.34 / 90.48 78.20 / 93.95 82.23 / 96.01

4 ImageNet Experiments and Results

We evaluate our proposed method on ImageNet [31] dataset (ILSVRC-2012), due to its widespread usage in supervised image recognition. Since the main purpose of data augmentation is to prevent overfitting, we also construct two smaller datasets from the training set of ImageNet by randomly sampling 100 and 20 images for each class, named ImageNet100 and ImageNet20. ImageNet100 is also used for ablation studies in this paper.

We apply augmentation pathways on six widely used backbone networks covering typical ConvNet developments from 2015 to 2022, including:

  • ResNet [14] (2015), stacking residual and non-linear blocks.

  • ResNeXt [32] (2017), repeating blocks that aggregates a set of transformations with the same topology.

  • MobileNetV2 [33] (2018), mobile architecture based on the inverted residuals and linear bottlenecks.

  • HRNet [30] (2019), exchanging information across steams with different resolutions.

  • iResNet [20] (2020), using ResGroup blocks with group convolutional layers, improved information flow and projection shortcut.

  • ConvNeXt [34] (2022), designed for “modernizing” ConvNet toward the design of a vision Transformer (e.g. Swin-T).

Single central-crop testing accuracies on the ImageNet validation set are applied as the evaluation metric for all experiments.

4.1 Implementation Details

Following standard practices [14, 2, 35], we perform standard (light) data augmentation with random cropping 224×\times224 pixels and random horizontal flipping for all baseline methods except ConvNeXt. Same with the original setting of ConvNeXt [34] training implementation111https://github.com/facebookresearch/ConvNeXt, we adopt schemes including Mixup, Cutmix, RandAugment, and Random Erasing as the light augmentations policies for ConvNeXt models. All other hyperparameters are consistent with each method’s default settings. The augmentation pathways version of baseline methods is designed by replacing all standard convolutional layers in the last stage [14, 35] (whose input size is 14×1414\!\times\!14, and output feature map size is 7×77\!\times\!7) by APk-Conv. We set the input and output channel sizes of each sub-convolution c1,c2,,ckc^{1},c^{2},...,c^{k} in APk-Conv as 1/k1/k of the input and output channel size in the replaced standard convolutional layer, respectively. For architectures containing group convolution layers, e.g. ResNeXt, MobileNetV2 and ConvNeXt, we remain the number of groups of each convolution inner every APk-Conv to be the same with its corresponding original group convolution layer. For HeAP networks, we equip heterogeneous augmentation pathways after each stage. More implementation details can be found in our released source code222https://github.com/ap-conv/ap-net.

4.2 Performance Comparison

Following the settings of other heavy augmentation related works [10, 6], we firstly apply RandAugment with hyperparameter m=9m\!=\!9, n=2n\!=\!2 for generating heavy augmented view φ\varphi. The experimental results on different network architectures are reported in Table II. Our proposed AP consistently benefits all these ConvNets with fewer model parameters and lower inference computational cost. It can be found that the RandAugment policy searched for ResNet-50 architecture results in a performance drop on iResNet-50333https://github.com/iduta/iresnet. While our augmentation pathways stably improve all architectures. The performance improvement of MobileNetV2 w/ AP is not as significant as the results on other architectures. It is mainly due to the limited parameters of MobileNetV2 bounded its feature representation ability and restricted the capacity of visual patterns from various augmented views. Besides, since we apply additional RandAugment policy based on the lightly augmented view ϕ\phi to generate the heavier augmented view φ\varphi for ConvNeXt, using RandAugment twice results in performance degradation on ConvNeXt-Tiny. However, our AP can still aggregate information beneficial to the classification task from the heavier augmented view φ\varphi. These experimental results demonstrate the robustness and generality of AP.

TABLE III: Performance comparison on ImageNet subsets. AP-ResNet achieves significant improvements with different heavy data augmentation policies.
Augmentation Model ImageNet100 ImageNet20
Random Crop,Flip ResNet 45.01 / 70.04 9.59 / 23.75
GridShuffle ResNet 43.95 / 68.97 9.88 / 23.81
AP-ResNet 45.62 / 70.93 11.53 / 27.85
MPN ResNet 45.51 / 70.78 10.64 / 25.36
AP-ResNet 46.98 / 71.64 11.14 / 26.57
Gray ResNet 45.83 / 71.08 9.63 / 24.49
AP-ResNet 46.83 / 72.01 11.68 / 27.85
RandAugment ResNet 51.75 / 75.66 17.59 / 37.06
AP-ResNet 53.74 / 76.83 20.80 / 40.86

AP on Fewer Labels We also applied augmentation pathways in small datasets ImageNet100 and ImageNet20 to test on the practical scenario of data scarcity. We selected three manually designed heavy data augmentations GridShuffle(g=7)(g\!=\!7), Gray(α=1)(\alpha\!=\!1), MPN(s=1.5)(s\!=\!1.5) and RandAugment(m=9,n=2)(m\!=9,n\!=2) besides light augmentations. The experimental results are reported in Table III. We can find that AP-Net significantly boosts the performance on small datasets. Note this is practically useful when training data is expensive to obtain.

High-order Homogeneous Augmentation Pathways In Table IV, we compare the results from the standard ResNet-50, its basic AP version, and 3rd-order version AP3. In detail, our 3rd-order augmentation pathway is designed for adapting two RandAugment with different hyper-parameters. We find that AP3 can further improve the performance of the 2nd-order basic AP-Conv based network. The significant gains as introducing more different hyper-parameters indicate that structuring the subdivision of generalities among different features spaces in a dependent manner benefits the object recognition.

TABLE IV: Recognition accuracy of: 1) 3rd-order augmentation pathway (AP3) based ResNet-50 by equipping additional augmentation RandAugment2((n,m){(1,5),(2,9)})((n,m)\in\{(1,5),(2,9)\}), and 2) heterogeneous augmentation pathways (HeAP4) based network by equipping additional augmentation RandAugment3((n,m){(1,5),(2,9),(4,15)})((n,m)\in\{(1,5),(2,9),(4,15)\}).
Method #Params. MACs Augmentation ImageNet100 ImageNet
ResNet [14, 13] 25.6M 4.11G Baseline 45.01 / 70.04 76.64 / 93.24
RandAugment2 51.67 / 75.45 77.03 / 93.41
AP-ResNet 21.8M 3.91G RandAugment2 53.58 / 76.61 77.59 / 93.68
AP3-ResNet 20.6M 3.84G RandAugment2 54.08 / 77.11 78.06 / 93.92
HRNet [30] 67.1M 14.93G Baseline 51.53 / 75.58 78.81 / 94.41
RandAugment3 53.52 / 77.54 77.28 / 93.95
HeAP4-HRNet 59.9M 13.97G RandAugment3 54.35 / 78.24 79.25 / 94.78
TABLE V: AP-ResNet-50 w/o sharing weights for GridShuffle(7).
mt=m_{t}= 12nt\frac{1}{2}n_{t} 23nt\frac{2}{3}n_{t} ntn_{t}
Acc. 45.59 ±\pm0.13 45.53 ±\pm0.11 43.95 ±\pm0.11
Refer to caption
Figure 6: The structure of augmentation pathway based convolutional layer without sharing feature.

High-order Heterogeneous Augmentation Pathways Following the framework described in Fig. 5, we adapt an HRNet-W44-C [30] style network architecture for 4th-order heterogeneous augmentation pathways network by replacing all multi-resolution convolution with HeAP4-Conv. Unlike the HRNet, which can only pass one image once, its HeAP4 variant can handle four different inputs simultaneously. The hierarchical classification head of HRNet is disabled in HeAP4. Four parallel loss functions follow four different neural pathways in HeAP4-HRNet. Only the neural pathway for lightly augmented inputs is activated during inference. Table IV summarizes the classification results of HRNet and our HeAP4-HRNet. HeAP4-HRNet significantly outperforms HRNet on ImageNet100 with fewer parameters and lower computational cost. Recall that HeAP4-HRNet and HRNet are two different architectures due to the completely different data flow, HeAP convolutional layers, and classification heads.

4.3 Discussions

To evaluate the statistical significance and stability of the proposed method, we report the mean and standard deviation of the accuracy from five trials for all below ablation experiments on ImageNet100.

Impact of the Cross Pathways Connections We design ablation studies by removing cross-pathways connections (w/o feature sharing among pathways) in AP-Conv but remaining the loss functions in Eq. (4) and Eq. (3.2) (as shown in Fig. 4.2). For standard ConvNet, heavily augmented views can directly influence the training of all parameters. However for AP-Net w/o sharing weights, heavily augmented views can only affect a half set of parameters’ training (if we set mt=12ntm_{t}\!=\!\frac{1}{2}n_{t} as default).

The results in Table VI show that (1) our proposed loss function leads to +0.87% improvement over baselines, and (2) AP-style architecture further boost 1.18% gain, due to the visual commonality learned among pathways.

Moreover, Table V shows that increasing the influence of heavily augmented views leads to performance drop (ConvNet is equal to AP-Net w/o sharing weight when mt=ntm_{t}=n_{t}). Such phenomenon is owing to the irrelevant feature bias introduced by the heavy augmentations. The divided pathways design can suppress such irrelevance.

TABLE VI: The effect of removing cross pathways connections, and randomly feeding inputs to different pathways. Heavy augmentation is RandAugment.
Method ImageNet100
ResNet-50 51.69 ±\pm0.09
AP-ResNet-50 w/o sharing feature 52.58 ±\pm0.11
AP-ResNet-50 w/ randomly input 52.80 ±\pm0.14
AP-ResNet-50 53.76 ±\pm0.08

Impact of Distortion Magnitudes of Augmentations The experimental results in Fig. 7 shows that our AP method can stably boosts the performance of ConvNet under various hyperparameters for RandAugment.

Refer to caption
Figure 7: Top-1 accuracy (%) on ImageNet100 by using RandAugment with different (nn,mm).

Impact of Cross Pathways Regularization SS To demonstrate the effects of SS, we perform the regularization item separation experiments on AP-ResNet-50 with RandAugment. The results are shown in Table VII. We also compared the AP-ResNet-50 performance by applying different settings of λ=n×ω\lambda=n\times\omega for evaluating AP-Net’s sensitivity to the choice of λ\lambda. It shows that cross pathways regularization benefits the feature space structure across different neural pathways, resulting in better performance. But too high loss weight for SS would lead to a performance drop, behaving similarly to the standard weight decay in the common neural network training.

TABLE VII: The impact of cross pathways regularization term SS and its weight for AP-ResNet-50 with RandAugment.
λ\lambda 10ω10\omega ω\omega 0.1ω0.1\omega 0.01ω0.01\omega 0 (w/o SS)
Acc. 52.86 ±\pm0.09 53.14 ±\pm0.08 53.76 ±\pm0.08 53.45 ±\pm0.10 53.19 ±\pm0.13

Generalize the ‘light vs. heavy” Augmentation Policy Settings to “basic vs. heavier” Inspired by the related work [6], defining dd as the deviation of augmented view from the original view, given two augmented view ϕ\phi and φ\varphi, we denote φ\varphi is heavier than ϕ\phi only if d(φ)>d(ϕ)d(\varphi)>d(\phi). There are two situations to adjudge d(φ)>d(ϕ)d(\varphi)>d(\phi): 1) φ\varphi and ϕ\phi are augmented by the same policies, but φ\varphi is augmented with more aggressive hyperparameter. 2) φ\varphi is augmented by policies which is a proper superset of augmentations used for generating ϕ\phi. In AP, the basic view ϕ\phi and the heavier view φ\varphi are fed to the main and augmentation pathway, respectively.

It means some heavy augmentation policies may generate basic view ϕ\phi, e.g. ConvNeXt applies the combination of Random Crop, Mixup, Cutmix, RandAugment, and Random Erasing as basic augmentations for generating ϕ\phi. We can introduce another RandAugment on ϕ\phi to generate heavier view φ\varphi for ConvNeXt. The experimental results in Table II show that AP-ConvNeXt-Tiny with twice RandAugment outperforms ConvNeXt-Tiny.

Accordingly, heavier view φ\varphi can be generated by applying additional light augmentation, e.g. we can apply another crop operation based on ϕ\phi to generate the heavier view φ\varphi (simulating the aggressive crop operation), and it still results in performance improvement, as shown in Table VIII.

TABLE VIII: Accuracy after introducing aggressive crop operation.
Method Augmentation ImageNet100
ResNet-50 Standard Crop 44.98 ±\pm0.10
Aggressive Crop 50.07 ±\pm0.12
AP-ResNet-50 Aggressive Crop 52.46 ±\pm0.09

Model Inference The augmented pathways are designed to stabilize main-pathway training when heavy data augmentations are present. During inference, no heavy augmentation are adopted, only fϕf_{\phi} in the main neural pathway for the original image are used for computing probability.

Model Complexity Although AP usually takes more memory cost during model training than the standard ConvNet, many connections can be cut out while replacing traditional convolutions with AP-Convs. Thus the AP version of a given standard CNN network has fewer parameters (#Params.) to learn and lower computational cost (GMACs, Multiply-Accumulate Operations) during inference, as specified in Tables IIIV and Eq. (3)..

5 Conclusion

The core concepts of our proposed Augmentation Pathways for stabilizing training with data augmentation can be concluded as: 1) Adapting different neural pathways for inputs with different characteristics. 2) Integrating shared features by considering visual dependencies among different inputs. Two extensions of AP are also introduced for handling data augmentations in various hyper-parameters. In general, our AP based network is more efficient than traditional CNN with fewer parameters and lower computational cost, and results in stable performance improvement on various datasets on a wide range of data augmentation polices.

Acknowledgments

This work was supported by the National Key R&D Program of China under Grand No.2020AAA0103800.

References

  • [1] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning (still) requires rethinking generalization,” Communications of the ACM, vol. 64, no. 3, pp. 107–115, 2021.
  • [2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [3] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
  • [4] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning.   PMLR, 2020, pp. 1597–1607.
  • [5] A. B. Jung, K. Wada, J. Crall, S. Tanaka, J. Graving, C. Reinders, S. Yadav, J. Banerjee, G. Vecsei, A. Kraft, Z. Rui, J. Borovec, C. Vallentin, S. Zhydenko, K. Pfeiffer, B. Cook, I. Fernández, F.-M. De Rainville, C.-H. Weng, A. Ayala-Acevedo, R. Meudec, M. Laporte et al., “imgaug,” https://github.com/aleju/imgaug, 2020, online; accessed 01-Feb-2020.
  • [6] Y. Bai, Y. Yang, W. Zhang, and T. Mei, “Directional self-supervised learning for heavy image augmentations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 692–16 701.
  • [7] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [8] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” arXiv preprint arXiv:1611.05431, 2016.
  • [9] X. Li, W. Wang, X. Hu, and J. Yang, “Selective kernel networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2019, pp. 510–519.
  • [10] X. Wang and G.-J. Qi, “Contrastive learning with stronger augmentations,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–12, 2022.
  • [11] Y. Chen, Y. Bai, W. Zhang, and T. Mei, “Destruction and construction learning for fine-grained image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 5157–5166.
  • [12] T. DeVries and G. W. Taylor, “Improved regularization of convolutional neural networks with cutout,” arXiv preprint arXiv:1708.04552, 2017.
  • [13] E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le, “Randaugment: Practical automated data augmentation with a reduced search space,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 702–703.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [15] A. Hernández-García, J. Mehrer, N. Kriegeskorte, P. König, and T. C. Kietzmann, “Deep neural networks trained with heavier data augmentation learn features closer to representations in hit,” in Conference on Cognitive Computational Neuroscience, vol. 1, 2018.
  • [16] A. Hernández-García, P. König, and T. C. Kietzmann, “Learning robust visual representations using data augmentation invariance,” arXiv preprint arXiv:1906.04547, 2019.
  • [17] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le, “Autoaugment: Learning augmentation strategies from data,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2019, pp. 113–123.
  • [18] S. Lim, I. Kim, T. Kim, C. Kim, and S. Kim, “Fast autoaugment,” in Advances in Neural Information Processing Systems, 2019, pp. 6662–6672.
  • [19] R. Hataya, J. Zdenek, K. Yoshizoe, and H. Nakayama, “Faster autoaugment: Learning augmentation strategies using backpropagation,” arXiv preprint arXiv:1911.06987, 2019.
  • [20] I. C. Duta, L. Liu, F. Zhu, and L. Shao, “Improved residual networks for image and video recognition,” arXiv preprint arXiv:2004.04989, 2020.
  • [21] D. Ho, E. Liang, I. Stoica, P. Abbeel, and X. Chen, “Population based augmentation: Efficient learning of augmentation policy schedules,” in ICML, 2019.
  • [22] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, 2017.
  • [23] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu et al., “Mmdetection: Open mmlab detection toolbox and benchmark,” arXiv preprint arXiv:1906.07155, 2019.
  • [24] R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollár, and K. He, “Detectron,” https://github.com/facebookresearch/detectron, 2018.
  • [25] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” arXiv preprint arXiv:1710.09412, 2017.
  • [26] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix: Regularization strategy to train strong classifiers with localizable features,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 6023–6032.
  • [27] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evolution for image classifier architecture search,” in Proceedings of the aaai conference on artificial intelligence, vol. 33, 2019, pp. 4780–4789.
  • [28] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures for scalable image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8697–8710.
  • [29] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in CVPR, 2019.
  • [30] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang, W. Liu, and B. Xiao, “Deep high-resolution representation learning for visual recognition,” TPAMI, 2019.
  • [31] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition.   Ieee, 2009, pp. 248–255.
  • [32] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1492–1500.
  • [33] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510–4520.
  • [34] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11 976–11 986.
  • [35] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
Yalong Bai is a Senior Researcher at JD.com. He received his Ph.D. degree in Harbin Institute of Technology and Microsoft Research Asia Joint Ph.D. Education Program at 2018. His research interests include representation learning, multimodal retrieval, visual question answering, and visual commonsense reasoning. He has won first place in several international challenges on CVPR, ICME and MM. He has also served as the Area Chair for ACM MM Challenge and ICASSP.
Mohan Zhou is currently a Ph.D. student at Harbin Institute of Technology, China, under the supervision of Prof. Tiejun Zhao. Meanwhile, he also works as a research intern at JD Explore Academy. Before that, he received his B.Eng. degree also from Harbin Institute of Technology in 2021. His current research interest includes representation learning and multilearning. He achieved impressive results in several fine-grained image classification competitions organized by CVPR workshop.
Wei Zhang is now a Senior Researcher at JD.com, Beijing, China. He received his Ph.D degree from the Department of Computer Science in City University of Hong Kong. His research interests include computer vision and multimedia, especially visual recognition and generation. He has won the Best Demo Awards in ACM MM 2021, and served as the Area Chair for ICME, ICASSP, and Technical Program Chair for ACM MM Asia 2023.
Bowen Zhou (Fellow, IEEE) has been the President of Artificial Intelligence Platform & Research of JD.com since September 2017. Bowen is a technologist and business leader of human language technologies, machine learning, and artificial intelligence. Prior to joining JD.com, Dr. Zhou held several key leadership positions during his 15-year tenure at IBM Research’s headquarters. He previously served as a member of the IEEE Speech and Language Technical Committee, Associate Editor of IEEE Transactions, ICASSP Area Chair (2011-2015), ACL, and NAACL Area Chair.
Tao Mei (Fellow, IEEE) is a vice president with JD.COM and the deputy managing director of JD Explore Academy, where he also serves as the director of Computer Vision and Multimedia Lab. Prior to joining JD.COM in 2018, he was a senior research manager with Microsoft Research Asia in Beijing, China. He has authored or coauthored more than 200 publications (with 12 best paper awards) in journals and conferences, 10 book chapters, and edited five books. He holds more than 25 U.S. and international patents. He is a fellow of IAPR (2016), a distinguished scientist of ACM (2016), and a distinguished Industry Speaker of IEEE Signal Processing Society (2017).