Balanced Group Convolution: An Improved Group Convolution Based on Approximability Estimates

Youngkyu Lee Jongho Park Chang-Ock Lee

Abstract

The performance of neural networks has been significantly improved by increasing the number of channels in convolutional layers. However, this increase in performance comes with a higher computational cost, resulting in numerous studies focused on reducing it. One promising approach to address this issue is group convolution, which effectively reduces the computational cost by grouping channels. However, to the best of our knowledge, there has been no theoretical analysis on how well the group convolution approximates the standard convolution. In this paper, we mathematically analyze the approximation of the group convolution to the standard convolution with respect to the number of groups. Furthermore, we propose a novel variant of the group convolution called balanced group convolution, which shows a higher approximation with a small additional computational cost. We provide experimental results that validate our theoretical findings and demonstrate the superior performance of the balanced group convolution over other variants of group convolution.

keywords:

Convolutional layer , Group convolution , Approximability estimate

^†^†journal: Neural Networks

\affiliation

[a]organization=Natural Science Research Institute, KAIST,state=Daejeon, postcode=34141, country=Korea \affiliation[b]organization=Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST),state=Thuwal, postcode=23955, country=Saudi Arabia \affiliation[c]organization=Department of Mathematical Sciences, KAIST,state=Daejeon, postcode=34141, country=Korea

1 Introduction

The convolutional layer plays a crucial role in the success of modern neural networks for image classification (He et al., 2016; Hu et al., 2018; Krizhevsky et al., 2017) and image processing problems (Radford et al., 2015; Ronneberger et al., 2015; Zhang et al., 2017). However, achieving high performance through convolutional neural networks (CNNs) typically requires the use of a large number of channels (Brock et al., 2021; He et al., 2022; Huang et al., 2019; Tan and Le, 2019; Zagoruyko and Komodakis, 2016), resulting in significant computational costs and long training times. Accordingly, there have been many studies focusing on modifying the convolutional layer to reduce its computational complexity (Gholami et al., 2022; Krizhevsky et al., 2017; Lee et al., 2022; Liu et al., 2020; Zhang et al., 2015a, b).

Among them, group convolution is the most basic modification that can be easily thought of. It was introduced in AlexNet (Krizhevsky et al., 2017) as a distributed computing method of convolutional layers to resolve the memory shortage. Group convolution divides the channels of each layer into groups and performs convolution only within each group. CNNs with the group convolution succeeded in reducing the number of parameters and implementation time, but its performance drops rapidly as the number of groups increases (Lee et al., 2022; Long et al., 2015). It has been speculated that this phenomenon is due to the fact that the lack of intergroup communication greatly affects the representation capacity of CNNs. However, it has not yet been mathematically revealed how much the performance is degraded.

Recently, several modifications of the group convolution, which add an intergroup communication, have been considered to restore the performance (Chollet, 2017; Lee et al., 2022; Long et al., 2015; Zhang et al., 2018). The channel shuffling structure used in ShuffleNet (Zhang et al., 2018) adds a permutation step among the output channels of the group convolution to make data exchange among groups. This method is efficient in terms of memory usage because it does not use any additional parameters. Learnable group convolution (Huang et al., 2018) determines channels to be grouped through learning. This method generates the weight for the group convolution by overlaying a trainable mask on the weight of the standard convolution. In fact, it is equivalent to changing the channel shuffling from a fixed rule to a learnable one. Although these methods do not make sense when viewed as a single layer, they effectively restore the performance of CNNs that use multiple layers. On the other hand, fully learnable group convolution (Zhang et al., 2018) introduces more parameters to additionally vary the number of channels in each group. In (Lee et al., 2022), it was observed that group convolution has a block diagonal matrix structure and two-level group convolution was introduced to collect representatives of each group to perform additional convolution with low computational cost. These works have successfully resolved the performance degradation issue described above, but there is still no mathematical analysis of why the performance is improved.

In this paper, we rigorously analyze the performance of the group convolution as an approximation to the corresponding standard convolution with respect to the number of groups $N$ . To achieve this, we introduce a new measure of approximability (see (3.2)), which quantifies the optimal squared $\ell^{2}$ -error between the outputs of the standard convolution and group convolution. We prove that the approximability of the group convolution is estimated as $K(1-1/N)^{2}$ , where $K$ denotes the number of parameters in a convolution layer that maps a single channel to a single channel (see Theorem 3.5).

In addition, we present a new variant of group convolution, called balanced group convolution (BGC), which achieves a theoretically improved approximability estimate compared to the plain group convolution. In BGC, the intergroup mean is computed by averaging the channels between groups. After that, we introduce an additional convolution to add the intergroup mean to the output of the group convolution. This allows each group in BGC to utilize the entire information of the input, resulting in an improved approximability with a small additional computational cost. Indeed, we prove that the approximability of BGC is estimated to be $K(1-1/N)^{3}$ (see Theorem 3.6), which is a better result than the plain group convolution. Furthermore, under an additional assumption, the approximability of the group convolution and BGC can be estimated to be $K(1-1/N)/n$ and $K(1-1/N)^{2}/n$ , respectively, where $n$ is the number of input channels. The superior performance of the proposed BGC is verified by various numerical experiments on several recent CNNs such as WideResNet (Zagoruyko and Komodakis, 2016), ResNeXt (Xie et al., 2017), MobileNetV2 (Sandler et al., 2018), and EfficientNet (Tan and Le, 2019).

In summary, we address the following issues in this work:

Is it possible to obtain a rigorous estimate for the approximability of group convolution? Furthermore, can we develop an enhanced version of group convolution that achieves a theoretically better approximability than the original?

We summarize our main contributions in the followings:

•

We propose BGC, which is an enhanced variant of group convolution with an improved approximability estimate.
•

We estimate the bounds on approximability of group convolution and BGC.
•

We demonstrate the performance of BGC by embedding it into state-of-the-art CNNs.

The rest of this paper is organized as follows. In Section 2, we present preliminaries of this paper, specifically related to the group convolution. We introduce the proposed BGC and its improved approximation properties in Section 3, and establish connections to existing group convolution approaches in Section 4. Numerical validations of the theoretical results and improved classification accuracy of the proposed BGC applied to various state-of-the-art CNNs are presented in Section 5. We conclude this paper with remarks in Section 6.

2 Group convolution

In this section, we introduce some basic notations for group convolution to be used throughout this paper. Let $m$ and $n$ be positive integers and $N$ be a common divisor of $m$ and $n$ . We write $m_{N}=m/N$ and $n_{N}=n/N$ .

We consider convolutional layers that operate on $n$ channels and produces $m$ channels. Input $\mathbf{x}$ of a convolutional layer is written as $\mathbf{x}=(x_{1},x_{2},\dots,x_{n})$ , where $x_{j}$ is the $j$ th channel of $\mathbf{x}$ . Similarly, output $\mathbf{y}$ is written as $\mathbf{y}=(y_{1},y_{2},\dots,y_{m})$ , where $y_{i}$ is the $i$ th channel of $\mathbf{y}$ . To represent a standard convolutional layer that takes input $\mathbf{x}$ and produces output $\mathbf{y}$ , we use the following matrix expression:

\begin{bmatrix}y_{1}\\ y_{2}\\ \vdots\\ y_{m}\end{bmatrix}=\begin{bmatrix}W_{11}&W_{12}&\cdots&W_{1n}\\ W_{21}&W_{22}&\cdots&W_{2n}\\ \vdots&\vdots&\ddots&\vdots\\ W_{m1}&W_{m2}&\cdots&W_{mn}\end{bmatrix}\begin{bmatrix}x_{1}\\ x_{2}\\ \vdots\\ x_{n}\end{bmatrix},

(2.1)

where $W_{ij}$ is a matrix that represents the generic convolution that maps $x_{j}$ to $y_{i}$ . We suppose that the channels of $\mathbf{x}$ and $\mathbf{y}$ are evenly partitioned into $N$ groups. Namely, let $\mathbf{x}=(\mathbf{x}^{1},\mathbf{x}^{2},\dots,\mathbf{x}^{N})$ and $\mathbf{y}=(\mathbf{y}^{1},\dots,\mathbf{y}^{N})$ , where

\mathbf{x}^{k}=\begin{bmatrix}x_{(k-1)n_{N}+1}\\ x_{(k-1)n_{N}+2}\\ \vdots\\ x_{kn_{N}}\end{bmatrix},\quad\mathbf{y}^{k}=\begin{bmatrix}y_{(k-1)m_{N}+1}\\ y_{(k-1)m_{N}+2}\\ \vdots\\ y_{km_{N}}\end{bmatrix}.

Then (2.1) is rewritten as the following block form:

\begin{bmatrix}\mathbf{y}^{1}\\ \mathbf{y}^{2}\\ \vdots\\ \mathbf{y}^{N}\end{bmatrix}=\begin{bmatrix}\mathbf{W}^{11}&\mathbf{W}^{12}&\cdots&\mathbf{W}^{1N}\\ \mathbf{W}^{21}&\mathbf{W}^{22}&\cdots&\mathbf{W}^{2N}\\ \vdots&\vdots&\ddots&\vdots\\ \mathbf{W}^{N1}&\mathbf{W}^{N2}&\cdots&\mathbf{W}^{NN}\end{bmatrix}\begin{bmatrix}\mathbf{x}^{1}\\ \mathbf{x}^{2}\\ \vdots\\ \mathbf{x}^{N}\end{bmatrix}.

(2.2)

Group convolution, a computationally efficient version of the convolution proposed in (Krizhevsky et al., 2017), is formed by discarding intergroup connection in (2.2). That is, we take only the block-diagonal part to obtain the group convolution, such as

\begin{bmatrix}\mathbf{y}^{1}\\ \mathbf{y}^{2}\\ \vdots\\ \mathbf{y}^{N}\end{bmatrix}=\begin{bmatrix}\mathbf{W}^{11}&\mathbf{O}&\cdots&\mathbf{O}\\ \mathbf{O}&\mathbf{W}^{22}&\cdots&\mathbf{O}\\ \vdots&\vdots&\ddots&\vdots\\ \mathbf{O}&\mathbf{O}&\cdots&\mathbf{W}^{NN}\end{bmatrix}\begin{bmatrix}\mathbf{x}^{1}\\ \mathbf{x}^{2}\\ \vdots\\ \mathbf{x}^{N}\end{bmatrix},

(2.3)

where $\mathbf{O}$ denotes the zero matrix. We observe that the number of parameters in the group convolution (2.3) is $1/N$ of that of the standard convolution (2.2). Moreover, the representation matrix in (2.3) is block-diagonal, allowing each block on the main diagonal to be processed in parallel, making it computationally more efficient than (2.2). However, when the number of parameters in the group convolution is reduced, the performance of a CNN using group convolutional layers may decrease compared to a CNN using standard convolutional layers (Lee et al., 2022; Zhang et al., 2018).

3 Main results

In this section, we present the main results of this paper, including the proposed BGC and estimates of its approximability and computational cost.

3.1 Balanced group convolution

A major drawback of the group convolution, which affects performance, is the lack of intergroup communication since the representation matrix given in (2.3) is block-diagonal. The motivation behind the proposed BGC is by adding intergroup communication to the group convolution with low computational cost to improve performance. To achieve this, we utilize a representative vector of small dimension that captures all the information in the input $\mathbf{x}$ . Specifically, we use the mean $\bar{\mathbf{x}}$ of $\{\mathbf{x}^{k}\}$ , i.e.,

\bar{\mathbf{x}}=\frac{1}{N}\sum_{k=1}^{N}\mathbf{x}^{k}.

Then, the proposed BGC is defined by

\begin{bmatrix}\mathbf{y}^{1}\\ \mathbf{y}^{2}\\ \vdots\\ \mathbf{y}^{N}\end{bmatrix}=\begin{bmatrix}\mathbf{W}^{11}&\mathbf{O}&\cdots&\mathbf{O}\\ \mathbf{O}&\mathbf{W}^{22}&\cdots&\mathbf{O}\\ \vdots&\vdots&\ddots&\vdots\\ \mathbf{O}&\mathbf{O}&\cdots&\mathbf{W}^{NN}\end{bmatrix}\begin{bmatrix}\mathbf{x}^{1}\\ \mathbf{x}^{2}\\ \vdots\\ \mathbf{x}^{N}\end{bmatrix}+\begin{bmatrix}\bar{\mathbf{W}}^{1}\\ \bar{\mathbf{W}}^{2}\\ \vdots\\ \bar{\mathbf{W}}^{N}\end{bmatrix}\bar{\mathbf{x}},

(3.1)

where each $\bar{\mathbf{W}}^{k}$ is a convolution that operates on $n_{N}$ channels and produces $m_{N}$ channels. That is, BGC is an extension of the group convolution that incorporates an additional structure based on the intergroup mean $\bar{\mathbf{x}}$ , which serves as a balancing factor that utilizes the entire information of $\mathbf{x}$ to intergroup communication. Specifically, this additional structure allows BGC to better distribute feature representations across groups.

Since the number of parameters in BGC (3.1) is $2/N$ of that of the standard convolution (2.2), the computational cost of BGC is significantly lower than that of the standard convolution. We will discuss this further in Section 3.3.

3.2 Approximability estimate

Thanks to the additional structure in the proposed BGC, we may expect that it behaves more similarly to the standard convolution than the group convolution. To provide a rigorous assessment of this behavior, we introduce a notion of approximability and prove that BGC has a better approximability to the standard convolution than the group convolution.

Suppose that the standard convolution $\mathbf{W}=[\mathbf{W}^{kl}]$ that maps $N$ groups to $N$ groups (see (2.2)) and the input $\mathbf{x}=[\mathbf{x}^{k}]$ are drawn from a certain probability distribution. We define the approximability of the group convolution $\mathbf{W}_{\mathrm{gp}}$ of the form (2.3) as

\mathbb{E}_{\mathbf{W}}\left[\inf_{\mathbf{W}_{\mathrm{gp}}}\mathbb{E}_{\mathbf{x}}\left[\|\mathbf{W}\mathbf{x}-\mathbf{W}_{\mathrm{gp}}\mathbf{x}\|^{2}\right]\right],

(3.2)

where $\|\cdot\|$ denotes the $\ell^{2}$ -norm. Namely, the approximability (3.2) measures the $\mathbf{W}$ -expectation of the optimal $\mathbf{x}$ -averaged squared $\ell^{2}$ -error between the output $\mathbf{W}\mathbf{x}$ of the standard convolution and the output $\mathbf{W}_{\mathrm{gp}}\mathbf{x}$ of the group convolution. The approximability of BGC is defined in the same manner.

To estimate the approximability (3.2), we need the following assumptions on the distributions of $\mathbf{W}$ and $\mathbf{x}$ .

Assumption 3.1.

A standard convolution $\mathbf{W}=[W_{ij}]$ and an input $\mathbf{x}=[x_{j}]$ are random variables that satisfy the following:

(i)

$\{W_{ij}\}$ are identically distributed.
(ii)

$\{x_{j}\}$ are independent and identically distributed (i.i.d.).
(iii)

$\mathbf{W}$ and $\mathbf{x}$ are independent.

Assumption 3.1(i) is essential to handle the off-diagonal parts of the group convolution when estimating (3.2). To effectively compensate for the absence of the off-diagonal parts in the group convolution using $\bar{\mathbf{x}}$ , we need Assumption 3.1(ii). On the other hand, Assumption 3.1(iii) is a quite natural assumption since it asserts that the convolution and input are independent of each other. Furthermore, if we additionally assume to the parameters of the standard convolution, we can get better estimates on the approximability of group convolution and BGC. This assumption is specified in Assumption 3.2.

Assumption 3.2.

A standard convolution $\mathbf{W}=[W_{ij}]$ is a random variable such that $\{W_{ij}\}$ are i.i.d. with $\mathbb{E}[W_{ij}]=0$ .

The independence of $W_{ij}$ is essential to obtain a sharper bound on the expectation of $\|\mathbf{W}\mathbf{x}\|^{2}$ . The condition $\mathbb{E}[W_{ij}]=0$ usually occurs when the parameters of the convolutional layers are generated by Glorot (Glorot and Bengio, 2010) or He (He et al., 2015) initialization, which are i.i.d. samples from random variable with zero mean.

The following lemma presents a Young-type inequality for the standard convolution, which plays a key role in the proof of the main theorems.

Lemma 3.3.

Let $\mathbf{W}$ be a standard convolution that operates on $n$ channels and produces $m$ channels. For any $n$ -channel input $\mathbf{x}$ , we have

\|\mathbf{W}\mathbf{x}\|\leq K^{\frac{1}{2}}\|\mathbf{W}\|_{\mathrm{par}}\|\mathbf{x}\|,

where $K$ is the number of parameters in a convolutional layer that maps a single channel to a single channel and $\|\mathbf{W}\|_{\mathrm{par}}$ denotes the $\ell^{2}$ -norm of the vector of parameters of the convolution $\mathbf{W}$ .

Proof.

We first prove the case $m=n=1$ . Let $W$ be a convolution that maps a single channel to a single channel and $x\in\mathbb{R}^{D}$ be an input. Since the output $Wx$ of the convolution is linear with respect to the parameters of $W$ , there exists a matrix $X\in\mathbb{R}^{D\times K}$ such that

Wx=Xw,

where $w\in\mathbb{R}^{K}$ is the vector of parameters of $W$ . Note that each entry of $X$ is also an entry of $x$ . For each entry $x_{d}$ ( $1\leq d\leq D$ ) of $x$ and each entry $w_{k}$ ( $1\leq k\leq K$ ) of $w$ , their product $w_{k}x_{d}$ appears at most one in the formulation of $Wx$ . Hence, the $k$ th column $v_{k}$ of $X$ satisfies

\|v_{k}\|\leq\|x\|.

(3.3)

By the triangle inequality, Cauchy–Schwarz inequality, and (3.3), we obtain

\begin{split}\|Wx\|&=\|Xw\|\leq\sum_{k=1}^{K}\|v_{k}w_{k}\|\\ &\leq\left(\sum_{k=1}^{K}\|v_{k}\|^{2}\right)^{\frac{1}{2}}\left(\sum_{k=1}^{K}w_{k}^{2}\right)^{\frac{1}{2}}\leq K^{\frac{1}{2}}\|W\|_{\mathrm{par}}\|x\|,\end{split}

(3.4)

which completes the proof of the case $m=n=1$ .

Next, we consider the general case. As in (2.1), we write $\mathbf{W}=[W_{ij}]$ and $\mathbf{x}=[x_{j}]$ . By the triangle inequality, (3.4), and the Cauchy–Schwarz inequality, we obtain

	$\displaystyle\\|\mathbf{W}\mathbf{x}\\|^{2}$	$\displaystyle=\sum_{i=1}^{m}\left\\|\sum_{j=1}^{n}W_{ij}x_{j}\right\\|^{2}\leq\sum_{i=1}^{m}\left(\sum_{j=1}^{n}\\|W_{ij}x_{j}\\|\right)^{2}$
		$\displaystyle\leq K\sum_{i=1}^{m}\left(\sum_{j=1}^{n}\\|W_{ij}\\|_{\mathrm{par}}\\|x_{j}\\|\right)^{2}$
		$\displaystyle\leq K\sum_{i=1}^{m}\left(\sum_{j=1}^{n}\\|W_{ij}\\|_{\mathrm{par}}^{2}\cdot\sum_{j=1}^{n}\\|x_{j}\\|^{2}\right)$
		$\displaystyle=K\\|\mathbf{W}\\|_{\mathrm{par}}^{2}\\|\mathbf{x}\\|^{2},$

which is our desired result. ∎

A direct consequence of Lemma 3.3 is that, if $\mathbf{W}$ and $\mathbf{x}$ are independent, then we have

\mathbb{E}_{\mathbf{W}}\mathbb{E}_{\mathbf{x}}\left[\|\mathbf{W}\mathbf{x}\|^{2}\right]\leq K\mathbb{E}\left[\|\mathbf{W}\|_{\mathrm{par}}^{2}\right]\mathbb{E}\left[\|\mathbf{x}\|^{2}\right].

(3.5)

In the following lemma, we show that the above estimate can be improved up to a multiplicative factor $1/n$ if we additionally assume that $W_{ij}$ has zero mean and that some random variables are independent and/or identically distributed.

Lemma 3.4.

Let $\mathbf{W}=[W_{ij}]$ be a standard convolution that operates on $n$ channels and produces $m$ channels, and let $\mathbf{x}=[x_{j}]$ be an $n$ -channel input. Assume that $\mathbf{W}$ and $\mathbf{x}$ satisfy the following:

(i)

$\{W_{ij}\}$ are i.i.d. with $\mathbb{E}[W_{ij}]=0$ .
(ii)

$\{x_{j}\}$ are identically distributed.
(iii)

$\mathbf{W}$ and $\mathbf{x}$ are independent.

Then we have

\mathbb{E}_{\mathbf{W}}\mathbb{E}_{\mathbf{x}}\left[\|\mathbf{W}\mathbf{x}\|^{2}\right]\leq\frac{K}{n}\mathbb{E}\left[\|\mathbf{W}\|_{\mathrm{par}}^{2}\right]\mathbb{E}\left[\|\mathbf{x}\|^{2}\right].

Proof.

We first prove the case $m=1$ . For $i\neq j$ , invoking (i) yields

\mathbb{E}_{\mathbf{W}}\mathbb{E}_{\mathbf{x}}\left[x_{i}^{\mathrm{T}}W_{1i}^{\mathrm{T}}W_{1j}x_{j}\right]=\mathbb{E}_{\mathbf{x}}\left[x_{i}^{\mathrm{T}}\mathbb{E}\left[W_{1i}^{\mathrm{T}}\right]\mathbb{E}\left[W_{1j}\right]x_{j}\right]=0.

(3.6)

On the other hand, for $1\leq j\leq n$ , by (iii) and (3.5), we have

\mathbb{E}_{W_{1j}}\mathbb{E}_{x_{j}}\left[\|W_{1j}x_{j}\|^{2}\right]\leq K\mathbb{E}\left[\|W_{1j}\|_{\mathrm{par}}^{2}\right]\mathbb{E}\left[\|x_{j}\|^{2}\right]=\frac{K}{n^{2}}\mathbb{E}\left[\|\mathbf{W}\|_{\mathrm{par}}^{2}\right]\mathbb{E}\left[\|\mathbf{x}\|^{2}\right],

(3.7)

where the last equality is due to (i) and (ii). By (3.6) and (3.7), we get

	$\displaystyle\mathbb{E}_{\mathbf{W}}\mathbb{E}_{\mathbf{x}}\left[\\|\mathbf{W}\mathbf{x}\\|^{2}\right]$	$\displaystyle=\sum_{i=1}^{n}\sum_{j=1}^{n}\mathbb{E}_{\mathbf{W}}\mathbb{E}_{\mathbf{x}}\left[x_{i}^{\mathrm{T}}W_{1i}^{\mathrm{T}}W_{1j}x_{j}\right]$
		$\displaystyle\leq\sum_{j=1}^{n}\frac{K}{n^{2}}\mathbb{E}\left[\\|\mathbf{W}\\|_{\mathrm{par}}^{2}\right]\mathbb{E}\left[\\|\mathbf{x}\\|^{2}\right]=\frac{K}{n}\mathbb{E}\left[\\|\mathbf{W}\\|_{\mathrm{par}}^{2}\right]\mathbb{E}\left[\\|\mathbf{x}\\|^{2}\right],$

which completes the proof of the case $m=1$ . The general case $m>1$ can be shown by applying the case $m=1$ to each $i$ th output channel ( $1\leq i\leq m$ ) and then summing over all $i$ . ∎

The representation matrix of the group convolution presented in (2.3) becomes more sparse as the number of groups $N$ increases. This suggests that the approximability of the group convolution decreases as $N$ increases. However, to the best of our knowledge, there has been no theoretical analysis on this. Theorem 3.5 presents the first theoretical result regarding the approximability of the group convolution.

Theorem 3.5.

For an $n$ -channel input $\mathbf{x}$ , let $\mathbf{W}\mathbf{x}$ and $\mathbf{W}_{\mathrm{gp}}\mathbf{x}$ denote the outputs of the standard and group convolutions given in (2.2) and (2.3), respectively. Under Assumption 3.1, we have

\mathbb{E}_{\mathbf{W}}\left[\inf_{\mathbf{W}_{\mathrm{gp}}}\mathbb{E}_{\mathbf{x}}\left[\|\mathbf{W}\mathbf{x}-\mathbf{W}_{\mathrm{gp}}\mathbf{x}\|^{2}\right]\right]\leq K\left(1-\frac{1}{N}\right)^{2}\mathbb{E}\left[\|\mathbf{W}\|_{\mathrm{par}}^{2}\right]\mathbb{E}\left[\|\mathbf{x}\|^{2}\right].

In addition, if we further assume that Assumption 3.2 holds, then we have

\mathbb{E}_{\mathbf{W}}\left[\inf_{\mathbf{W}_{\mathrm{gp}}}\mathbb{E}_{\mathbf{x}}\left[\|\mathbf{W}\mathbf{x}-\mathbf{W}_{\mathrm{bal}}\mathbf{x}\|^{2}\right]\right]\leq\frac{K}{n}\left(1-\frac{1}{N}\right)\mathbb{E}\left[\|\mathbf{W}\|_{\mathrm{par}}^{2}\right]\mathbb{E}\left[\|\mathbf{x}\|^{2}\right].

Proof.

Take any standard convolution $\mathbf{W}=[\mathbf{W}^{kl}]$ and any input $\mathbf{x}=[\mathbf{x}^{k}]$ as in (2.2). If we set

\mathbf{W}_{\mathrm{gp}}=\begin{bmatrix}\mathbf{W}^{11}&\mathbf{O}&\cdots&\mathbf{O}\\ \mathbf{O}&\mathbf{W}^{22}&\cdots&\mathbf{O}\\ \vdots&\vdots&\ddots&\vdots\\ \mathbf{O}&\mathbf{O}&\cdots&\mathbf{W}^{NN}\end{bmatrix},

i.e., if we choose $\mathbf{W}_{\mathrm{gp}}$ as the block-diagonal part of $\mathbf{W}$ , then we have

	$\displaystyle\mathbb{E}_{\mathbf{W}}\left[\inf_{\mathbf{W}_{\mathrm{gp}}}\mathbb{E}_{\mathbf{x}}\left[\\|\mathbf{W}\mathbf{x}-\mathbf{W}_{\mathrm{gp}}\mathbf{x}\\|^{2}\right]\right]$	$\displaystyle\leq\mathbb{E}_{\mathbf{W}}\mathbb{E}_{\mathbf{x}}\left[\\|\mathbf{W}\mathbf{x}-\mathbf{W}_{\mathrm{gp}}\mathbf{x}\\|^{2}\right]$
		$\displaystyle=\sum_{k=1}^{N}\mathbb{E}_{\mathbf{W}}\mathbb{E}_{\mathbf{x}}\left[\left\\|\sum_{l\neq k}\mathbf{W}^{kl}\mathbf{x}^{l}\right\\|^{2}\right].$		(3.8)

For each $k$ , since the map $\mathbf{x}\mapsto\sum_{l\neq k}\mathbf{W}^{kl}\mathbf{x}^{l}$ is a convolution that operates on $(n-n/N)$ channels ( $\mathbf{x}_{k}$ is excluded from the input), invoking Lemmas 3.3 and 3.4 yields

\mathbb{E}_{\mathbf{W}}\mathbb{E}_{\mathbf{x}}\left[\left\|\sum_{l\neq k}\mathbf{W}^{kl}\mathbf{x}^{l}\right\|^{2}\right]\leq C_{n,N,k}\left(\sum_{l\neq k}\mathbb{E}\left[\|\mathbf{W}^{kl}\|_{\mathrm{par}}^{2}\right]\right)\left(\sum_{l\neq k}\mathbb{E}\left[\|\mathbf{x}^{l}\|^{2}\right]\right),

(3.9)

where

C_{n,N,k}=\begin{cases}K,&\quad\text{ under \lx@cref{creftype~refnum}{ass:data},}\\ \frac{K}{n-n/N},&\quad\text{ under \lx@cref{creftypeplural~refnum}{ass:data} and~\lx@cref{refnum}{ass:input}.}\end{cases}

(3.10)

Note that the independence condition between $\mathbf{W}$ and $\mathbf{x}$ are used in (3.9).

Meanwhile, Assumption 3.1 implies that

\mathbb{E}\left[\|\mathbf{W}^{kl}\|_{\mathrm{par}}^{2}\right]=\frac{1}{N^{2}}\mathbb{E}\left[\|\mathbf{W}\|_{\mathrm{par}}^{2}\right],\quad\mathbb{E}\left[\|\mathbf{x}^{l}\|^{2}\right]=\frac{1}{N}\mathbb{E}\left[\|\mathbf{x}\|^{2}\right].

(3.11)

Finally, combination of (3.2), (3.9), and (3.11) yields

\mathbb{E}_{\mathbf{W}}\left[\inf_{\mathbf{W}_{\mathrm{gp}}}\mathbb{E}_{\mathbf{x}}\left[\|\mathbf{W}\mathbf{x}-\mathbf{W}_{\mathrm{gp}}\mathbf{x}\|^{2}\right]\right]\leq C_{n,N,k}\left(1-\frac{1}{N}\right)^{2}\mathbb{E}\left[\|\mathbf{W}\|_{\mathrm{par}}^{2}\right]\mathbb{E}\left[\|\mathbf{x}\|^{2}\right],

which completes the proof. ∎

As discussed above, BGC (3.1) has the additional structure that utilizes the intergroup mean $\bar{\mathbf{x}}$ to compensate for the absence of the off-diagonal parts in the group convolution. We obtain the main theoretical result summarized in Theorem 3.6, showing that BGC achieves an improved approximability estimate compared to the group convolution.

Theorem 3.6.

For an $n$ -channel input $\mathbf{x}$ , let $\mathbf{W}\mathbf{x}$ and $\mathbf{W}_{\mathrm{bal}}\mathbf{x}$ denote the outputs of the standard convolution and BGC given in (2.2) and (3.1), respectively. Under Assumption 3.1, we have

\mathbb{E}_{\mathbf{W}}\left[\inf_{\mathbf{W}_{\mathrm{bal}}}\mathbb{E}_{\mathbf{x}}\left[\|\mathbf{W}\mathbf{x}-\mathbf{W}_{\mathrm{bal}}\mathbf{x}\|^{2}\right]\right]\leq K\left(1-\frac{1}{N}\right)^{3}\mathbb{E}\left[\|\mathbf{W}\|_{\mathrm{par}}^{2}\right]\mathbb{E}\left[\|\mathbf{x}\|^{2}\right].

In addition, if we further assume that Assumption 3.2 holds, we have

\mathbb{E}_{\mathbf{W}}\left[\inf_{\mathbf{W}_{\mathrm{bal}}}\mathbb{E}_{\mathbf{x}}\left[\|\mathbf{W}\mathbf{x}-\mathbf{W}_{\mathrm{bal}}\mathbf{x}\|^{2}\right]\right]\leq\frac{K}{n}\left(1-\frac{1}{N}\right)^{2}\mathbb{E}\left[\|\mathbf{W}\|_{\mathrm{par}}^{2}\right]\mathbb{E}\left[\|\mathbf{x}\|^{2}\right].

Proof.

Take any standard convolution $\mathbf{W}=[\mathbf{W}^{kl}]$ and any input $\mathbf{x}=[\mathbf{x}^{k}]$ as in (2.2). If we choose $\mathbf{W}_{\mathrm{bal}}$ as

\mathbf{W}_{\mathrm{bal}}\mathbf{x}=\begin{bmatrix}\mathbf{W}^{11}&\mathbf{O}&\cdots&\mathbf{O}\\ \mathbf{O}&\mathbf{W}^{22}&\cdots&\mathbf{O}\\ \vdots&\vdots&\ddots&\vdots\\ \mathbf{O}&\mathbf{O}&\cdots&\mathbf{W}^{NN}\end{bmatrix}\begin{bmatrix}\mathbf{x}^{1}\\ \mathbf{x}^{2}\\ \vdots\\ \mathbf{x}^{N}\end{bmatrix}+\begin{bmatrix}\sum_{l\neq 1}\mathbf{W}^{1l}\\ \sum_{l\neq 2}\mathbf{W}^{2l}\\ \vdots\\ \sum_{l\neq N}\mathbf{W}^{Nl}\end{bmatrix}\bar{\mathbf{x}},

then we get

\|\mathbf{W}\mathbf{x}-\mathbf{W}_{\mathrm{bal}}\mathbf{x}\|^{2}=\sum_{k=1}^{N}\left\|\sum_{l\neq k}\mathbf{W}^{kl}(\mathbf{x}^{l}-\bar{\mathbf{x}})\right\|^{2}.

Note that, in the proof of Theorem 3.5, the independence condition of $\{x_{j}\}$ was never used. Hence, we can use the same argument as in the proof of Theorem 3.5 to obtain

\begin{split}&\mathbb{E}_{\mathbf{W}}\left[\inf_{\mathbf{W}_{\mathrm{gp}}}\mathbb{E}_{\mathbf{x}}\left[\|\mathbf{W}\mathbf{x}-\mathbf{W}_{\mathrm{gp}}\mathbf{x}\|^{2}\right]\right]\\ &\leq C_{n,N,k}\left(1-\frac{1}{N}\right)^{2}\mathbb{E}\left[\|\mathbf{W}\|_{\mathrm{par}}^{2}\right]\mathbb{E}\left[\|\mathbf{x}-\bar{\mathbf{x}}\|^{2}\right]\\ &=C_{n,N,k}\left(1-\frac{1}{N}\right)^{2}\mathbb{E}\left[\|\mathbf{W}\|_{\mathrm{par}}^{2}\right]\left(\mathbb{E}\left[\|\mathbf{x}\|^{2}\right]-N\mathbb{E}\left[\|\bar{\mathbf{x}}\|^{2}\right]\right),\end{split}

(3.12)

where $C_{n,N,k}$ was given in (3.10). Meanwhile, since $\{\mathbf{x}^{k}\}$ are i.i.d., we have

\mathbb{E}\left[\|\bar{\mathbf{x}}\|^{2}\right]=\frac{1}{N^{2}}\mathbb{E}\left[\|\mathbf{x}\|^{2}\right]+\frac{N-1}{N}\left\|\mathbb{E}\left[\mathbf{x}^{1}\right]\right\|^{2}\geq\frac{1}{N^{2}}\mathbb{E}\left[\|\mathbf{x}\|^{2}\right].

(3.13)

Combining (3.12) and (3.13) yields the desired result. ∎

Remark 3.7.

By Assumption 3.1, the approximability of the group convolution has a bound of $K(1-1/N)^{2}$ . Adding Assumption 3.2 to this provides $K(1-1/N)/n$ bound. Since $n\geq N$ , the latter is a sharper bound. On the other hand, in both cases, the approximability of BGC has sharper estimates than the group convolution by $(1-1/N)$ factor.

3.3 Computational cost

We now consider the computational cost for a single convolutional layer with the proposed BGC. Let $D$ be the size of each input channel, i.e., $x_{j}\in\mathbb{R}^{D}$ . The number of scalar arithmetic operations required in a single operation of BGC is given by

2Nm_{N}n_{N}KD+nD=\frac{2KDmn}{N}+Dn.

(3.14)

In the right-hand side of (3.14), the first term corresponds to $2N$ blocks of convolutional layers in (3.1), and the second term corresponds to the computation of $\bar{\mathbf{x}}$ . Noting that the standard convolution and the group convolution require $KDmn$ and $KDmn/N$ scalar arithmetic operations, respectively, we conclude that the computational cost of BGC is approximately $2/N$ of that of the standard convolution, but twice as much as that of the group convolution.

4 Comparison with existing works

Table 1: Overview of various variants of group convolution (GC), where

n

and

m

denote the numbers of input and output channels, respectively,

N

denotes the number of groups, and

K

represents the number of parameters of the single-channel convolutional layer.

Method

Intergroup

commun.

# of channels

per group

Grouping

# of parameters

Comput.

cost

GC (Krizhevsky et al., 2017)

n/N

deterministic

Kmn/N

O(1/N)

Shuffle (Zhang et al., 2018)

Yes

n/N

deterministic

Kmn/N

O(1/N)

Learnable GC (Huang et al., 2018)

Yes

n/N

adaptive

2Kmn

greater

Fully learnable GC (Wang et al., 2019)

Yes

varies

adaptive

Kmn+Nn+Nm

greater

Two-level GC (Lee et al., 2022)

Yes

n/N

deterministic

Kmn/N+Kn+Nm

greater

BGC

Yes

n/N

deterministic

2Kmn/N

O(1/N)

In this section, we compare BGC with other variants of group convolution. Various methods have been proposed to improve the performance of group convolution by enabling intergroup communication: Shuffle (Zhang et al., 2018), learnable group convolution (LGC) (Huang et al., 2018), fully learnable group convolution (FLGC) (Wang et al., 2019), and two-level group convolution (TLGC) (Lee et al., 2022). In BGC, input and output channels are evenly partitioned into groups to ensure that all groups have the same computational cost, preventing computational bottlenecks. This feature also applies to GC, Shuffle, LGC, and TLGC, but deterministically to BGC, i.e., it uses a fixed partition that is independent of convolution and input. Hence, the computational cost for partitioning into blocks is less expensive than those of LGC and FLGC, which have different partitions for each convolution.

The number of parameters in BGC decreases at a rate of $O(1/N)$ as the number of groups $N$ increases, which is consistent with the rates of GC and Shuffle. Consequently, the computational cost of BGC can be scaled by $1/N$ , similar to GC and Shuffle. This is contrast to LGC, FLGC, and TLGC, which have additional parameters that are not $O(1/N)$ , making their computational costs greater than $O(1/N)$ . Note that the total computational cost of LGC, FLGC, and TLGC is $DKmn/N+(N-1)*(Kmn/N+mn/N+1)$ , $DKmn/N+(n+m)N^{2}+Kmn/N$ , and $DKmn/N+DKn+DNm$ , respectively.

As discussed in Theorem 3.6, the upper bound on the approximability of BGC is $K(1-1/N)^{3}$ , which is better than GC in Theorem 3.5 by a factor of $(1-1/N)$ . To the best of our knowledge, this is the first rigorous analysis of the performance of a group convolution variant. On the other hand, there is still no approximation theory for Shuffle, LGC, FLGC, and TLGC.

Table 1 provides an overview of the comparison between BGC and the other variants of group convolution discussed above. As shown in Table 1, Shuffle (Zhang et al., 2018) shares all the advantages of BGC except for the theoretical guarantee of the approximability. However, BGC has another advantage over Shuffle. While Shuffle relies on layer propagation to have intergroup communication and provide good accuracy, BGC incorporates intergroup communication in each layer, increasing accuracy even when the network is not deep enough. This is particularly advantageous when BGC is applied to CNNs with wide but not deep layers, such as WideResNet (Zagoruyko and Komodakis, 2016). This assertion will be further supported by the numerical results in the next section.

5 Numerical experiments

In this section, we present numerical results that verify our theoretical results and demonstrate the performance of the proposed BGC. All programs were implemented in Python with PyTorch (Paszke et al., 2019) and all computations were performed on a cluster equipped with Intel Xeon Gold 6240R (2.4GHz, 24C) CPUs, NVIDIA RTX 3090 GPUs, and the operating system CentOS 7.

5.1 Verification of the approximability estimates

Refer to caption — (a) $S=10^{2}$ , $\mathcal{N}(0,1)$

We verify the approximability estimates of GC and BGC, showing that the estimate on approximability of BGC is better than that of GC, shown in Theorems 3.5 and 3.6. We consider a set of one-dimensional standard convolutional layers $\{\mathbf{W}^{(t)}\}_{t=1}^{S}$ with $m=n=256$ and $K=3$ , which is generated by He initialization (He et al., 2015). We also sample $S$ random data points $\{\mathbf{x}^{(s)}\}_{s=1}^{S}$ from either the normal distribution $\mathcal{N}(0,1)$ or the uniform distribution $\mathcal{U}(-1,1)$ . To measure the approximability of GC $(\mathrm{m}=\mathrm{gp})$ and BGC $(\mathrm{m}=\mathrm{bal})$ with respect to $\mathbf{W}$ , we use the following measure:

\mathcal{E}=\frac{1}{S}\sum_{t=1}^{S}\left(\min_{\mathbf{W}_{\mathrm{m}}}\frac{1}{S}\sum_{s=1}^{S}\|\mathbf{W}^{(t)}\mathbf{x}^{(s)}-\mathbf{W}_{\mathrm{m}}\mathbf{x}^{(s)}\|^{2}\right),

(5.1)

which can be evaluated by conventional least squares solvers (Golub and Van Loan, 2013).

The graphs in Figure 1 depict $\log\mathcal{E}$ with respect to $\log(1-1/N)$ at various settings for GC and BGC. One can readily observe linear relations between $\log\mathcal{E}$ and $\log(1-1/N)$ for both GC and BGC. That is, we have an empirical formula

\mathcal{E}\approx C\left(1-\frac{1}{N}\right)^{\gamma}\frac{1}{S}\sum_{t=1}^{S}\|\mathbf{W}^{(t)}\|^{2}\frac{1}{S}\sum_{s=1}^{S}\|\mathbf{x}^{(s)}\|^{2}

(5.2)

for some positive constants $C$ and $\gamma$ independent of $N$ . From the graph, we see that $\gamma\approx 1$ for GC and $\gamma\approx 2$ for BGC, which confirm our theoretical results in Theorems 3.5 and 3.6. Next, we define a relative approximability

\mathrm{Rel.}\mathcal{E}=\frac{\mathcal{E}}{\frac{1}{S}\sum_{t=1}^{S}\|\mathbf{W}^{(t)}\|^{2}\frac{1}{S}\sum_{s=1}^{S}\|\mathbf{x}^{(s)}\|^{2}}.

(5.3)

In Figure 2, we plot $\mathrm{Rel.}\mathcal{E}/(1-1/N)^{p}$ with respect to $N$ under various settings for GC and BGC, where $p=1$ for GC and $p=2$ for BGC. We observe that all values of $\mathrm{Rel.}\mathcal{E}/(1-1/N)^{p}$ are bounded by $3.83\times 10^{-3}$ independent of $N$ , which implies the constant $C$ is less than $K/n\approx 11.72\times 10^{-3}$ in (5.2). This observation also agrees with the theoretical results in Theorems 3.5 and 3.6.

5.2 Embedding in recent CNNs

In Section 5.1, we verified that our approximability estimates are consistent with experimental results using synthetic data. Now, we examine the classification performance of BGC applied to various recent CNNs, including ResNet (He et al., 2016), WideResNet (Zagoruyko and Komodakis, 2016), ResNeXt (Xie et al., 2017), MobileNetV2 (Sandler et al., 2018), and EfficientNet (Tan and Le, 2019). The network structures used in our experiments are described below:

•

ResNet. In our experiments, we used the ResNet-50, which was used for ImageNet classifications. It uses a bottleneck structure consisting of a $1\times 1$ standard convolution, a group convolution with a $3\times 3$ kernel, another $1\times 1$ standard convolution, and a skip connection.
•

WideResNet. In our experiments, we used two different structures of WideResNet: WideResNet-28-10 and WideResNet-34-2, which were used for CIFAR-10 and ImageNet classifications, respectively. It uses a residual unit consisting of two $3\times 3$ convolutions and a skip connection.
•

ResNeXt. It uses the same structure as the ResNet except using $3\times 3$ group convolution instead of $3\times 3$ convolution. In our experiments, we used ResNeXt-29 $(8\times 64\text{d})$ for CIFAR-10 classification. We applied the group convolutions to two $1\times 1$ standard convolutions.
•

MobileNetV2. The basic structure of MobileNetV2 (Sandler et al., 2018) is an inverted residual block, similar to the bottleneck in ResNeXt. However, MobileNetV2 uses a depthwise convolution (Chollet, 2017) instead of ResNeXt’s $3\times 3$ group convolution. In our experiments, we applied group convolution to two $1\times 1$ standard convolutions.
•

EfficientNet. It is based on MnasNet (Tan et al., 2019), a modification of MobileNetV2. Using an automated machine learning technique (Zoph and Le, 2017), EfficientNet proposes several model structures with appropriate numbers of channels and depth. It also has several variations, from b0 to b7 models, depending on the number of parameters. In our experiments, we used the most basic model b0.

Table 2: Details of CIFAR-10 and ImageNet datasets.

Dataset	Image size	Classes	# of training / validation samples
CIFAR-10	$32\times 32$	10	$50{,}000$ / $10{,}000$
ImageNet	$224\times 224$	$1{,}000$	$1{,}280{,}000$ / $50{,}000$

We evaluate the performance of various variants of group convolution on the datasets CIFAR-10 (Krizhevsky, 2009) and ImageNet ILSVRC 2012 (Deng et al., 2009). Details on these datasets are in Table 2. In addition, for the CIFAR-10 dataset, a data augmentation technique in (Lee et al., 2015) was adopted for training; four pixels are padded on each side of images and $32\times 32$ random crops are sampled from the padded images and their horizontal flips. For the ImageNet dataset, input images of size $224\times 224$ are randomly cropped from a resized image using the scale and aspect ratio augmentation of (Szegedy et al., 2015), which was implemented by (Gross and Wilber, 2016).

5.2.1 Ablation study through transfer learning

First, we will apply GC and BGC to a pre-trained network on real data to see how well BGC works compared to GC. We select a pre-trained ResNet-50 trained on the ImageNet dataset. Note that the pre-trained parameters can be found in the PyTorch library (Paszke et al., 2019). By transferring the parameters of the pre-trained standard convolution in ResNet-50, we obtained the parameters of (2.3) and (3.1), which are referred to as ResNet-50-GC and ResNet-50-BGC, respectively. We then further trained the ResNet-50-GC and ResNet-50-BGC for $30$ epochs using SGD optimizer with batch size $128$ , learning rate $0.01$ , Nesterov momentum $0.9$ , and weight decay $0.0001$ .

The classification performance of ResNet-50-GC and ResNet-50-BGC is given in Table 3. Compared to GC, the classification performance of BGC improves by as little as 4% and as much as 15%. Therefore, even for real data, it can be confirmed that BGC definitely complements the performance of the neural network degraded by GC.

Table 3: Classification errors (%) of ResNet-50-BC and ResNet-50-BGC on ImageNet dataset after transfer learning. Each network is transfer learned from the pre-trained ResNet-50 provided in PyTorch. Note that the case of

N=1

is the classification error of ResNet-50.

$N$	ResNet-50-GC	ResNet-50-BGC
1	23.87
2	26.54	22.45
4	36.70	25.51
8	43.72	31.13
16	50.09	35.45

5.2.2 Computational efficiency

To verify the computational efficiency of our BGC, we conducted experiments to increase the number of groups $N$ of $3\times 3$ convolution mapping $1024$ channels to $1024$ channels with input $\mathbf{x}\in\mathbb{R}^{128\times 1024\times 7\times 7}$ . Note that this convolution is used for ResNet-50. Table 4 shows the total memory usage, computation time, and classification errors for forward and backward propagations of convolutions equipped with GC and BGC. Note that the number of groups $N$ varies from $2$ to $16$ . The results are computed on a single GPU. First, looking at the total memory usage, BGC uses more memory than GC, but the gap narrows as $N$ increases. The additional memory usage of BGC occurs when computing the mean $\bar{\mathbf{x}}$ and the convolution $\bar{\mathbf{W}}$ defined in (3.1). On the other hand, looking at computation time, BGC is slower than GC, but faster than the standard convolution when $N>2$ . As can be seen in (3.14), compared to GC, BGC requires more computation time because it performs convolution twice, but as $N$ increases, the time decreases. Moreover, we can see that the cost of calculating the mean $\bar{\mathbf{x}}$ is small enough that it has little effect on the total computation time. Eventually, these results suggest that while BGC increases memory consumption and computation time compared to GC, it can improve performance with only a small increase in the computational cost when dealing with large channels in group convolution.

Table 4: Total memory usage, computation time, and classification errors (%) for forward and backward propagations of

3\times 3

convolution from

1024

channels to

1024

channels equipped with GC and BGC and the input

\mathbf{x}\in\mathbb{R}^{128\times 1024\times 7\times 7}

. Note that the case of

N=1

is computed with the standard convolution.

N	Total memory usage (Mb)		Computation time (ms)		Classification error (%)
N	GC	BGC	GC	BGC	GC	BGC
1	60.50		7.979		23.87
2	42.50	75.75	9.056	10.467	26.54	22.45
4	33.50	48.62	4.449	4.790	36.70	25.51
8	29.00	37.06	2.120	2.648	43.72	31.13
16	26.75	31.53	2.265	2.557	50.09	35.45

5.2.3 Comparison with existing approaches

Table 5: Classification errors (%) of EfficientNet-b0, WideResNet-28-10, and ResNeXt-29

(8\times 64\text{d})

on the CIFAR-10 dataset equipped with GC, Shuffle, fully learnable group convolution (FLGC), two-level group convolution (TLGC), and BGC. The case of

N=1

corresponds to the standard convolution (SC). Note that the EfficientNet-b0 could not increase

N

up to

16

due to its structure, so is was implemented only up to

8

$N$	Method	EfficientNet-b0	WideResNet-28-10	ResNeXt-29 ( $8\times 64\text{d}$ )
1	SC	8.42	3.88	5.32
2	GC	10.02	4.33	5.92
	Shuffle	8.72	4.00	4.52
	FLGC	8.63	4.62	4.68
	TLGC	8.93	4.12	5.16
	BGC	7.16	3.75	5.84
4	GC	11.25	5.44	6.76
	Shuffle	9.41	4.21	4.17
	FLGC	9.70	6.16	5.16
	TLGC	9.42	4.34	6.92
	BGC	7.50	4.02	5.84

$N$	Method	EfficientNet-b0	WideResNet-28-10	ResNeXt-29 ( $8\times 64\text{d}$ )
1	SC	8.42	3.88	5.32
8	GC	13.14	6.43	6.52
	Shuffle	10.65	4.73	4.81
	FLGC	10.49	9.81	5.25
	TLGC	10.68	4.79	6.56
	BGC	7.80	4.22	6.00
16	GC	-	8.47	6.92
	Shuffle		5.28	4.89
	FLGC		11.26	6.11
	TLGC		4.95	6.08
	BGC		4.76	5.60

Table 6: Classification errors (%) of EfficientNet-b0, WideResNet-34-2, and MobileNetV2 on the ImageNet dataset equipped with GC, Shuffle, fully learnable group convolution (FLGC), two-level group convolution (TLGC), and BGC. The case of

N=1

corresponds to the standard convolution (SC) equipped to CNNs.

$N$	Method	EfficientNet-b0	WideResNet-34-2	MobileNetV2
1	SC	30.68	24.99	34.13
2	GC	35.18	26.92	39.50
	Shuffle	33.91	25.12	36.67
	FLGC	35.53	30.53	37.98
	TLGC	33.50	25.36	33.56
	BGC	31.82	24.19	36.10
4	GC	41.64	31.96	46.15
	Shuffle	38.02	26.84	38.14
	FLGC	40.71	38.30	44.37
	TLGC	36.38	27.17	36.88
	BGC	33.78	25.36	37.82
8	GC	48.42	36.70	51.97
	Shuffle	42.74	29.03	42.75
	FLGC	45.38	45.54	49.28
	TLGC	38.66	28.43	39.16
	BGC	38.13	27.16	42.65

As benchmark group convolution variants, we choose GC (Krizhevsky et al., 2017), Shuffle (Zhang et al., 2018), FLGC (Wang et al., 2019), and TLGC (Lee et al., 2022), which were discussed in Section 4. All neural networks for the CIFAR-10 dataset in this section were trained using stochastic gradient descent with batch size $128$ , weight decay $0.0005$ , Nesterov momentum $0.9$ , total epoch $200$ , and weights initialized as in (He et al., 2015). The initial learning rate was set to $0.1$ and was reduced to its one tenth in the 60th, 120th, and 160th epochs. For ImageNet, the hyperparameter settings are the same as the CIFAR case, except for the weight decay $0.0001$ , total epoch $90$ , and the learning rate reduced by a factor of 10 in the 30th and 60th epochs.

The classification errors of BGC, along with other benchmarks, applied to various CNNs on the CIFAR-10 dataset are presented in Table 5. BGC shows better overall results than other methods. In particular, the application of BGC significantly improves the classification performance of EfficientNet-b0. In addition, the classification errors of BGC for WideResNet-28-10 are always less than 5% when $N$ varies up to $16$ . Although Shuffle performed the best for ResNeXt-29, BGC still outperforms the classification performance of GC. To further validate the performance of BGC, we report the classification results on the ImageNet dataset in Table 6. For this dataset, we observe that BGC outperforms other benchmark group convolution variants for CNN architectures except MobileNetV2. Therefore, through several experiments, we conclude that BGC is an effective and theoretically guaranteed alternative to group convolution for various CNN architectures on large-scale datasets.

6 Conclusion

In this paper, we proposed a novel variant of group convolution called BGC. We constructed BGC by combining the plain group convolution structure with a balancing term defined as the intergroup mean to improve intergroup communication. We designed a new measure (3.2) to assess the approximability of group convolution variants and proved that the approximability of group convolution is bounded by $K(1-1/N)^{2}$ . Also, we showed that the bound on approximability of proposed BGC is $K(1-1/N)^{3}$ , which is an improved bound compared to the group convolution. Moreover, under the additional assumption about the parameters of the standard convolution, we showed that the bounds for the approximability of group convolution and BGC are $K(1-1/N)/n$ and $K(1-1/N)^{2}/n$ , respectively. Numerical experiments with various CNNs such as WideResNet, MobileNetV2, ResNeXt, and EfficientNet have demonstrated the practical efficacy of BGC on large-scale neural networks and datasets.

We conclude this paper with a remark on BGC. A major drawback of the proposed BGC is that it requires full data communication among groups. This means that when computing the intergroup mean $\bar{\mathbf{x}}$ that appears in the balancing term, we need the entire input $\mathbf{x}$ . This high volume of communication can be a bottleneck in parallel computation, which limits the performance of the model in distributed memory systems. We note that TLGC (Lee et al., 2022) has addressed this issue by minimizing the amount of communication required. Exploring how to improve BGC by reducing communication in a similar manner to (Lee et al., 2022), while maintaining strong performance in both theory and practice, is considered as a future work.

Acknowledgments

This work was supported in part by the National Research Foundation (NRF) of Korea grant funded by the Korea government (MSIT) (No. RS-2023-00208914), and in part by Basic Science Research Program through NRF funded by the Ministry of Education (No. 2019R1A6A1A10073887).

References

Brock et al. (2021) Brock, A., De, S., Smith, S.L., Simonyan, K., 2021. High-performance large-scale image recognition without normalization, in: Proceedings of the 38th International Conference on Machine Learning, PMLR. pp. 1059–1071.
Chollet (2017) Chollet, F., 2017. Xception: Deep learning with depthwise separable convolutions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L., 2009. ImageNet: A large-scale hierarchical image database, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255.
Gholami et al. (2022) Gholami, A., Kim, S., Zhen, D., Yao, Z., Mahoney, M., Keutzer, K., 2022. Low-Power Computer Vision: Improve the Efficiency of Artificial Intelligence (1st ed.). Chapman and Hall/CRC, Boca Raton, FL. chapter 13. pp. 291–326.
Glorot and Bengio (2010) Glorot, X., Bengio, Y., 2010. Understanding the difficulty of training deep feedforward neural networks, in: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, PMLR. pp. 249–256.
Golub and Van Loan (2013) Golub, G.H., Van Loan, C.F., 2013. Matrix Computations. Johns Hopkins University Press, Baltimore.
Gross and Wilber (2016) Gross, S., Wilber, M., 2016. Training and investigating residual nets. Facebook AI Research 6.
He et al. (2022) He, J., Li, L., Xu, J., 2022. Approximation properties of deep ReLU CNNs. Research in the Mathematical Sciences 9, 38.
He et al. (2015) He, K., Zhang, X., Ren, S., Sun, J., 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV).
He et al. (2016) He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Hu et al. (2018) Hu, J., Shen, L., Sun, G., 2018. Squeeze-and-Excitation networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Huang et al. (2018) Huang, G., Liu, S., van der Maaten, L., Weinberger, K.Q., 2018. CondenseNet: An efficient DenseNet using learned group convolutions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Huang et al. (2019) Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., Chen, M., Lee, H., Ngiam, J., Le, Q.V., Wu, Y., Chen, z., 2019. GPipe: Efficient training of giant neural networks using pipeline parallelism, in: Advances in Neural Information Processing Systems, Curran Associates, Inc.
Krizhevsky (2009) Krizhevsky, A., 2009. Learning multiple layers of features from tiny images. Technical Report. University of Toronto.
Krizhevsky et al. (2017) Krizhevsky, A., Sutskever, I., Hinton, G.E., 2017. ImageNet classification with deep convolutional neural networks. Communications of the ACM 60, 84–90.
Lee et al. (2015) Lee, C.Y., Xie, S., Gallagher, P., Zhang, Z., Tu, Z., 2015. Deeply-Supervised Nets, in: Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, PMLR. pp. 562–570.
Lee et al. (2022) Lee, Y., Park, J., Lee, C.O., 2022. Two-level group convolution. Neural Networks 154, 323–332.
Liu et al. (2020) Liu, J., Tripathi, S., Kurup, U., Shah, M., 2020. Pruning algorithms to accelerate convolutional neural networks for edge applications: A survey. arXiv preprint arXiv:2005.04275 .
Long et al. (2015) Long, J., Shelhamer, E., Darrell, T., 2015. Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S., 2019. PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32, 8024–8035.
Radford et al. (2015) Radford, A., Metz, L., Chintala, S., 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 .
Ronneberger et al. (2015) Ronneberger, O., Fischer, P., Brox, T., 2015. U-Net: Convolutional networks for biomedical image segmentation, in: Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, Springer, Cham. pp. 234–241.
Sandler et al. (2018) Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C., 2018. MobileNetV2: Inverted residuals and linear bottlenecks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Szegedy et al. (2015) Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A., 2015. Going deeper with convolutions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Tan et al. (2019) Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., Le, Q.V., 2019. MnasNet: Platform-aware neural architecture search for mobile, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Tan and Le (2019) Tan, M., Le, Q., 2019. EfficientNet: Rethinking model scaling for convolutional neural networks, in: Proceedings of the 36th International Conference on Machine Learning, PMLR. pp. 6105–6114.
Wang et al. (2019) Wang, X., Kan, M., Shan, S., Chen, X., 2019. Fully learnable group convolution for acceleration of deep neural networks, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Xie et al. (2017) Xie, S., Girshick, R., Dollar, P., Tu, Z., He, K., 2017. Aggregated residual transformations for deep neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Zagoruyko and Komodakis (2016) Zagoruyko, S., Komodakis, N., 2016. Wide residual networks, in: Proceedings of the British Machine Vision Conference (BMVC), BMVA Press. pp. 87.1–87.12.
Zhang et al. (2017) Zhang, K., Zuo, W., Chen, Y., Meng, D., Zhang, L., 2017. Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising. IEEE Transactions on Image Processing 26, 3142–3155.
Zhang et al. (2018) Zhang, X., Zhou, X., Lin, M., Sun, J., 2018. Shufflenet: An extremely efficient convolutional neural network for mobile devices, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Zhang et al. (2015a) Zhang, X., Zou, J., He, K., Sun, J., 2015a. Accelerating very deep convolutional networks for classification and detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 1943–1955.
Zhang et al. (2015b) Zhang, X., Zou, J., Ming, X., He, K., Sun, J., 2015b. Efficient and accurate approximations of nonlinear convolutional networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Zoph and Le (2017) Zoph, B., Le, Q., 2017. Neural architecture search with reinforcement learning, in: International Conference on Learning Representations.

	$\displaystyle\\|\mathbf{W}\mathbf{x}\\|^{2}$	$\displaystyle=\sum_{i=1}^{m}\left\\|\sum_{j=1}^{n}W_{ij}x_{j}\right\\|^{2}\leq\sum_{i=1}^{m}\left(\sum_{j=1}^{n}\\|W_{ij}x_{j}\\|\right)^{2}$
		$\displaystyle\leq K\sum_{i=1}^{m}\left(\sum_{j=1}^{n}\\|W_{ij}\\|_{\mathrm{par}}\\|x_{j}\\|\right)^{2}$
		$\displaystyle\leq K\sum_{i=1}^{m}\left(\sum_{j=1}^{n}\\|W_{ij}\\|_{\mathrm{par}}^{2}\cdot\sum_{j=1}^{n}\\|x_{j}\\|^{2}\right)$
		$\displaystyle=K\\|\mathbf{W}\\|_{\mathrm{par}}^{2}\\|\mathbf{x}\\|^{2},$

	$\displaystyle\mathbb{E}_{\mathbf{W}}\left[\inf_{\mathbf{W}_{\mathrm{gp}}}\mathbb{E}_{\mathbf{x}}\left[\\|\mathbf{W}\mathbf{x}-\mathbf{W}_{\mathrm{gp}}\mathbf{x}\\|^{2}\right]\right]$	$\displaystyle\leq\mathbb{E}_{\mathbf{W}}\mathbb{E}_{\mathbf{x}}\left[\\|\mathbf{W}\mathbf{x}-\mathbf{W}_{\mathrm{gp}}\mathbf{x}\\|^{2}\right]$
		$\displaystyle=\sum_{k=1}^{N}\mathbb{E}_{\mathbf{W}}\mathbb{E}_{\mathbf{x}}\left[\left\\|\sum_{l\neq k}\mathbf{W}^{kl}\mathbf{x}^{l}\right\\|^{2}\right].$		(3.8)