This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Balanced Group Convolution: An Improved Group Convolution Based on Approximability Estimates

Youngkyu Lee Jongho Park Chang-Ock Lee
Abstract

The performance of neural networks has been significantly improved by increasing the number of channels in convolutional layers. However, this increase in performance comes with a higher computational cost, resulting in numerous studies focused on reducing it. One promising approach to address this issue is group convolution, which effectively reduces the computational cost by grouping channels. However, to the best of our knowledge, there has been no theoretical analysis on how well the group convolution approximates the standard convolution. In this paper, we mathematically analyze the approximation of the group convolution to the standard convolution with respect to the number of groups. Furthermore, we propose a novel variant of the group convolution called balanced group convolution, which shows a higher approximation with a small additional computational cost. We provide experimental results that validate our theoretical findings and demonstrate the superior performance of the balanced group convolution over other variants of group convolution.

keywords:
Convolutional layer , Group convolution , Approximability estimate
journal: Neural Networks
\affiliation

[a]organization=Natural Science Research Institute, KAIST,state=Daejeon, postcode=34141, country=Korea \affiliation[b]organization=Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST),state=Thuwal, postcode=23955, country=Saudi Arabia \affiliation[c]organization=Department of Mathematical Sciences, KAIST,state=Daejeon, postcode=34141, country=Korea

1 Introduction

The convolutional layer plays a crucial role in the success of modern neural networks for image classification (He et al., 2016; Hu et al., 2018; Krizhevsky et al., 2017) and image processing problems (Radford et al., 2015; Ronneberger et al., 2015; Zhang et al., 2017). However, achieving high performance through convolutional neural networks (CNNs) typically requires the use of a large number of channels (Brock et al., 2021; He et al., 2022; Huang et al., 2019; Tan and Le, 2019; Zagoruyko and Komodakis, 2016), resulting in significant computational costs and long training times. Accordingly, there have been many studies focusing on modifying the convolutional layer to reduce its computational complexity (Gholami et al., 2022; Krizhevsky et al., 2017; Lee et al., 2022; Liu et al., 2020; Zhang et al., 2015a, b).

Among them, group convolution is the most basic modification that can be easily thought of. It was introduced in AlexNet (Krizhevsky et al., 2017) as a distributed computing method of convolutional layers to resolve the memory shortage. Group convolution divides the channels of each layer into groups and performs convolution only within each group. CNNs with the group convolution succeeded in reducing the number of parameters and implementation time, but its performance drops rapidly as the number of groups increases (Lee et al., 2022; Long et al., 2015). It has been speculated that this phenomenon is due to the fact that the lack of intergroup communication greatly affects the representation capacity of CNNs. However, it has not yet been mathematically revealed how much the performance is degraded.

Recently, several modifications of the group convolution, which add an intergroup communication, have been considered to restore the performance (Chollet, 2017; Lee et al., 2022; Long et al., 2015; Zhang et al., 2018). The channel shuffling structure used in ShuffleNet (Zhang et al., 2018) adds a permutation step among the output channels of the group convolution to make data exchange among groups. This method is efficient in terms of memory usage because it does not use any additional parameters. Learnable group convolution (Huang et al., 2018) determines channels to be grouped through learning. This method generates the weight for the group convolution by overlaying a trainable mask on the weight of the standard convolution. In fact, it is equivalent to changing the channel shuffling from a fixed rule to a learnable one. Although these methods do not make sense when viewed as a single layer, they effectively restore the performance of CNNs that use multiple layers. On the other hand, fully learnable group convolution (Zhang et al., 2018) introduces more parameters to additionally vary the number of channels in each group. In (Lee et al., 2022), it was observed that group convolution has a block diagonal matrix structure and two-level group convolution was introduced to collect representatives of each group to perform additional convolution with low computational cost. These works have successfully resolved the performance degradation issue described above, but there is still no mathematical analysis of why the performance is improved.

In this paper, we rigorously analyze the performance of the group convolution as an approximation to the corresponding standard convolution with respect to the number of groups NN. To achieve this, we introduce a new measure of approximability (see (3.2)), which quantifies the optimal squared 2\ell^{2}-error between the outputs of the standard convolution and group convolution. We prove that the approximability of the group convolution is estimated as K(11/N)2K(1-1/N)^{2}, where KK denotes the number of parameters in a convolution layer that maps a single channel to a single channel (see Theorem 3.5).

In addition, we present a new variant of group convolution, called balanced group convolution (BGC), which achieves a theoretically improved approximability estimate compared to the plain group convolution. In BGC, the intergroup mean is computed by averaging the channels between groups. After that, we introduce an additional convolution to add the intergroup mean to the output of the group convolution. This allows each group in BGC to utilize the entire information of the input, resulting in an improved approximability with a small additional computational cost. Indeed, we prove that the approximability of BGC is estimated to be K(11/N)3K(1-1/N)^{3} (see Theorem 3.6), which is a better result than the plain group convolution. Furthermore, under an additional assumption, the approximability of the group convolution and BGC can be estimated to be K(11/N)/nK(1-1/N)/n and K(11/N)2/nK(1-1/N)^{2}/n, respectively, where nn is the number of input channels. The superior performance of the proposed BGC is verified by various numerical experiments on several recent CNNs such as WideResNet (Zagoruyko and Komodakis, 2016), ResNeXt (Xie et al., 2017), MobileNetV2 (Sandler et al., 2018), and EfficientNet (Tan and Le, 2019).

In summary, we address the following issues in this work:

Is it possible to obtain a rigorous estimate for the approximability of group convolution? Furthermore, can we develop an enhanced version of group convolution that achieves a theoretically better approximability than the original?

We summarize our main contributions in the followings:

  • We propose BGC, which is an enhanced variant of group convolution with an improved approximability estimate.

  • We estimate the bounds on approximability of group convolution and BGC.

  • We demonstrate the performance of BGC by embedding it into state-of-the-art CNNs.

The rest of this paper is organized as follows. In Section 2, we present preliminaries of this paper, specifically related to the group convolution. We introduce the proposed BGC and its improved approximation properties in Section 3, and establish connections to existing group convolution approaches in Section 4. Numerical validations of the theoretical results and improved classification accuracy of the proposed BGC applied to various state-of-the-art CNNs are presented in Section 5. We conclude this paper with remarks in Section 6.

2 Group convolution

In this section, we introduce some basic notations for group convolution to be used throughout this paper. Let mm and nn be positive integers and NN be a common divisor of mm and nn. We write mN=m/Nm_{N}=m/N and nN=n/Nn_{N}=n/N.

We consider convolutional layers that operate on nn channels and produces mm channels. Input 𝐱\mathbf{x} of a convolutional layer is written as 𝐱=(x1,x2,,xn)\mathbf{x}=(x_{1},x_{2},\dots,x_{n}), where xjx_{j} is the jjth channel of 𝐱\mathbf{x}. Similarly, output 𝐲\mathbf{y} is written as 𝐲=(y1,y2,,ym)\mathbf{y}=(y_{1},y_{2},\dots,y_{m}), where yiy_{i} is the iith channel of 𝐲\mathbf{y}. To represent a standard convolutional layer that takes input 𝐱\mathbf{x} and produces output 𝐲\mathbf{y}, we use the following matrix expression:

[y1y2ym]=[W11W12W1nW21W22W2nWm1Wm2Wmn][x1x2xn],\begin{bmatrix}y_{1}\\ y_{2}\\ \vdots\\ y_{m}\end{bmatrix}=\begin{bmatrix}W_{11}&W_{12}&\cdots&W_{1n}\\ W_{21}&W_{22}&\cdots&W_{2n}\\ \vdots&\vdots&\ddots&\vdots\\ W_{m1}&W_{m2}&\cdots&W_{mn}\end{bmatrix}\begin{bmatrix}x_{1}\\ x_{2}\\ \vdots\\ x_{n}\end{bmatrix}, (2.1)

where WijW_{ij} is a matrix that represents the generic convolution that maps xjx_{j} to yiy_{i}. We suppose that the channels of 𝐱\mathbf{x} and 𝐲\mathbf{y} are evenly partitioned into NN groups. Namely, let 𝐱=(𝐱1,𝐱2,,𝐱N)\mathbf{x}=(\mathbf{x}^{1},\mathbf{x}^{2},\dots,\mathbf{x}^{N}) and 𝐲=(𝐲1,,𝐲N)\mathbf{y}=(\mathbf{y}^{1},\dots,\mathbf{y}^{N}), where

𝐱k=[x(k1)nN+1x(k1)nN+2xknN],𝐲k=[y(k1)mN+1y(k1)mN+2ykmN].\mathbf{x}^{k}=\begin{bmatrix}x_{(k-1)n_{N}+1}\\ x_{(k-1)n_{N}+2}\\ \vdots\\ x_{kn_{N}}\end{bmatrix},\quad\mathbf{y}^{k}=\begin{bmatrix}y_{(k-1)m_{N}+1}\\ y_{(k-1)m_{N}+2}\\ \vdots\\ y_{km_{N}}\end{bmatrix}.

Then (2.1) is rewritten as the following block form:

[𝐲1𝐲2𝐲N]=[𝐖11𝐖12𝐖1N𝐖21𝐖22𝐖2N𝐖N1𝐖N2𝐖NN][𝐱1𝐱2𝐱N].\begin{bmatrix}\mathbf{y}^{1}\\ \mathbf{y}^{2}\\ \vdots\\ \mathbf{y}^{N}\end{bmatrix}=\begin{bmatrix}\mathbf{W}^{11}&\mathbf{W}^{12}&\cdots&\mathbf{W}^{1N}\\ \mathbf{W}^{21}&\mathbf{W}^{22}&\cdots&\mathbf{W}^{2N}\\ \vdots&\vdots&\ddots&\vdots\\ \mathbf{W}^{N1}&\mathbf{W}^{N2}&\cdots&\mathbf{W}^{NN}\end{bmatrix}\begin{bmatrix}\mathbf{x}^{1}\\ \mathbf{x}^{2}\\ \vdots\\ \mathbf{x}^{N}\end{bmatrix}. (2.2)

Group convolution, a computationally efficient version of the convolution proposed in (Krizhevsky et al., 2017), is formed by discarding intergroup connection in (2.2). That is, we take only the block-diagonal part to obtain the group convolution, such as

[𝐲1𝐲2𝐲N]=[𝐖11𝐎𝐎𝐎𝐖22𝐎𝐎𝐎𝐖NN][𝐱1𝐱2𝐱N],\begin{bmatrix}\mathbf{y}^{1}\\ \mathbf{y}^{2}\\ \vdots\\ \mathbf{y}^{N}\end{bmatrix}=\begin{bmatrix}\mathbf{W}^{11}&\mathbf{O}&\cdots&\mathbf{O}\\ \mathbf{O}&\mathbf{W}^{22}&\cdots&\mathbf{O}\\ \vdots&\vdots&\ddots&\vdots\\ \mathbf{O}&\mathbf{O}&\cdots&\mathbf{W}^{NN}\end{bmatrix}\begin{bmatrix}\mathbf{x}^{1}\\ \mathbf{x}^{2}\\ \vdots\\ \mathbf{x}^{N}\end{bmatrix}, (2.3)

where 𝐎\mathbf{O} denotes the zero matrix. We observe that the number of parameters in the group convolution (2.3) is 1/N1/N of that of the standard convolution (2.2). Moreover, the representation matrix in (2.3) is block-diagonal, allowing each block on the main diagonal to be processed in parallel, making it computationally more efficient than (2.2). However, when the number of parameters in the group convolution is reduced, the performance of a CNN using group convolutional layers may decrease compared to a CNN using standard convolutional layers (Lee et al., 2022; Zhang et al., 2018).

3 Main results

In this section, we present the main results of this paper, including the proposed BGC and estimates of its approximability and computational cost.

3.1 Balanced group convolution

A major drawback of the group convolution, which affects performance, is the lack of intergroup communication since the representation matrix given in (2.3) is block-diagonal. The motivation behind the proposed BGC is by adding intergroup communication to the group convolution with low computational cost to improve performance. To achieve this, we utilize a representative vector of small dimension that captures all the information in the input 𝐱\mathbf{x}. Specifically, we use the mean 𝐱¯\bar{\mathbf{x}} of {𝐱k}\{\mathbf{x}^{k}\}, i.e.,

𝐱¯=1Nk=1N𝐱k.\bar{\mathbf{x}}=\frac{1}{N}\sum_{k=1}^{N}\mathbf{x}^{k}.

Then, the proposed BGC is defined by

[𝐲1𝐲2𝐲N]=[𝐖11𝐎𝐎𝐎𝐖22𝐎𝐎𝐎𝐖NN][𝐱1𝐱2𝐱N]+[𝐖¯1𝐖¯2𝐖¯N]𝐱¯,\begin{bmatrix}\mathbf{y}^{1}\\ \mathbf{y}^{2}\\ \vdots\\ \mathbf{y}^{N}\end{bmatrix}=\begin{bmatrix}\mathbf{W}^{11}&\mathbf{O}&\cdots&\mathbf{O}\\ \mathbf{O}&\mathbf{W}^{22}&\cdots&\mathbf{O}\\ \vdots&\vdots&\ddots&\vdots\\ \mathbf{O}&\mathbf{O}&\cdots&\mathbf{W}^{NN}\end{bmatrix}\begin{bmatrix}\mathbf{x}^{1}\\ \mathbf{x}^{2}\\ \vdots\\ \mathbf{x}^{N}\end{bmatrix}+\begin{bmatrix}\bar{\mathbf{W}}^{1}\\ \bar{\mathbf{W}}^{2}\\ \vdots\\ \bar{\mathbf{W}}^{N}\end{bmatrix}\bar{\mathbf{x}}, (3.1)

where each 𝐖¯k\bar{\mathbf{W}}^{k} is a convolution that operates on nNn_{N} channels and produces mNm_{N} channels. That is, BGC is an extension of the group convolution that incorporates an additional structure based on the intergroup mean 𝐱¯\bar{\mathbf{x}}, which serves as a balancing factor that utilizes the entire information of 𝐱\mathbf{x} to intergroup communication. Specifically, this additional structure allows BGC to better distribute feature representations across groups.

Since the number of parameters in BGC (3.1) is 2/N2/N of that of the standard convolution (2.2), the computational cost of BGC is significantly lower than that of the standard convolution. We will discuss this further in Section 3.3.

3.2 Approximability estimate

Thanks to the additional structure in the proposed BGC, we may expect that it behaves more similarly to the standard convolution than the group convolution. To provide a rigorous assessment of this behavior, we introduce a notion of approximability and prove that BGC has a better approximability to the standard convolution than the group convolution.

Suppose that the standard convolution 𝐖=[𝐖kl]\mathbf{W}=[\mathbf{W}^{kl}] that maps NN groups to NN groups (see (2.2)) and the input 𝐱=[𝐱k]\mathbf{x}=[\mathbf{x}^{k}] are drawn from a certain probability distribution. We define the approximability of the group convolution 𝐖gp\mathbf{W}_{\mathrm{gp}} of the form (2.3) as

𝔼𝐖[inf𝐖gp𝔼𝐱[𝐖𝐱𝐖gp𝐱2]],\mathbb{E}_{\mathbf{W}}\left[\inf_{\mathbf{W}_{\mathrm{gp}}}\mathbb{E}_{\mathbf{x}}\left[\|\mathbf{W}\mathbf{x}-\mathbf{W}_{\mathrm{gp}}\mathbf{x}\|^{2}\right]\right], (3.2)

where \|\cdot\| denotes the 2\ell^{2}-norm. Namely, the approximability (3.2) measures the 𝐖\mathbf{W}-expectation of the optimal 𝐱\mathbf{x}-averaged squared 2\ell^{2}-error between the output 𝐖𝐱\mathbf{W}\mathbf{x} of the standard convolution and the output 𝐖gp𝐱\mathbf{W}_{\mathrm{gp}}\mathbf{x} of the group convolution. The approximability of BGC is defined in the same manner.

To estimate the approximability (3.2), we need the following assumptions on the distributions of 𝐖\mathbf{W} and 𝐱\mathbf{x}.

Assumption 3.1.

A standard convolution 𝐖=[Wij]\mathbf{W}=[W_{ij}] and an input 𝐱=[xj]\mathbf{x}=[x_{j}] are random variables that satisfy the following:

  1. (i)

    {Wij}\{W_{ij}\} are identically distributed.

  2. (ii)

    {xj}\{x_{j}\} are independent and identically distributed (i.i.d.).

  3. (iii)

    𝐖\mathbf{W} and 𝐱\mathbf{x} are independent.

Assumption 3.1(i) is essential to handle the off-diagonal parts of the group convolution when estimating (3.2). To effectively compensate for the absence of the off-diagonal parts in the group convolution using 𝐱¯\bar{\mathbf{x}}, we need Assumption 3.1(ii). On the other hand, Assumption 3.1(iii) is a quite natural assumption since it asserts that the convolution and input are independent of each other. Furthermore, if we additionally assume to the parameters of the standard convolution, we can get better estimates on the approximability of group convolution and BGC. This assumption is specified in Assumption 3.2.

Assumption 3.2.

A standard convolution 𝐖=[Wij]\mathbf{W}=[W_{ij}] is a random variable such that {Wij}\{W_{ij}\} are i.i.d. with 𝔼[Wij]=0\mathbb{E}[W_{ij}]=0.

The independence of WijW_{ij} is essential to obtain a sharper bound on the expectation of 𝐖𝐱2\|\mathbf{W}\mathbf{x}\|^{2}. The condition 𝔼[Wij]=0\mathbb{E}[W_{ij}]=0 usually occurs when the parameters of the convolutional layers are generated by Glorot (Glorot and Bengio, 2010) or He (He et al., 2015) initialization, which are i.i.d. samples from random variable with zero mean.

The following lemma presents a Young-type inequality for the standard convolution, which plays a key role in the proof of the main theorems.

Lemma 3.3.

Let 𝐖\mathbf{W} be a standard convolution that operates on nn channels and produces mm channels. For any nn-channel input 𝐱\mathbf{x}, we have

𝐖𝐱K12𝐖par𝐱,\|\mathbf{W}\mathbf{x}\|\leq K^{\frac{1}{2}}\|\mathbf{W}\|_{\mathrm{par}}\|\mathbf{x}\|,

where KK is the number of parameters in a convolutional layer that maps a single channel to a single channel and 𝐖par\|\mathbf{W}\|_{\mathrm{par}} denotes the 2\ell^{2}-norm of the vector of parameters of the convolution 𝐖\mathbf{W}.

Proof.

We first prove the case m=n=1m=n=1. Let WW be a convolution that maps a single channel to a single channel and xDx\in\mathbb{R}^{D} be an input. Since the output WxWx of the convolution is linear with respect to the parameters of WW, there exists a matrix XD×KX\in\mathbb{R}^{D\times K} such that

Wx=Xw,Wx=Xw,

where wKw\in\mathbb{R}^{K} is the vector of parameters of WW. Note that each entry of XX is also an entry of xx. For each entry xdx_{d} (1dD1\leq d\leq D) of xx and each entry wkw_{k} (1kK1\leq k\leq K) of ww, their product wkxdw_{k}x_{d} appears at most one in the formulation of WxWx. Hence, the kkth column vkv_{k} of XX satisfies

vkx.\|v_{k}\|\leq\|x\|. (3.3)

By the triangle inequality, Cauchy–Schwarz inequality, and (3.3), we obtain

Wx=Xwk=1Kvkwk(k=1Kvk2)12(k=1Kwk2)12K12Wparx,\begin{split}\|Wx\|&=\|Xw\|\leq\sum_{k=1}^{K}\|v_{k}w_{k}\|\\ &\leq\left(\sum_{k=1}^{K}\|v_{k}\|^{2}\right)^{\frac{1}{2}}\left(\sum_{k=1}^{K}w_{k}^{2}\right)^{\frac{1}{2}}\leq K^{\frac{1}{2}}\|W\|_{\mathrm{par}}\|x\|,\end{split} (3.4)

which completes the proof of the case m=n=1m=n=1.

Next, we consider the general case. As in (2.1), we write 𝐖=[Wij]\mathbf{W}=[W_{ij}] and 𝐱=[xj]\mathbf{x}=[x_{j}]. By the triangle inequality, (3.4), and the Cauchy–Schwarz inequality, we obtain

𝐖𝐱2\displaystyle\|\mathbf{W}\mathbf{x}\|^{2} =i=1mj=1nWijxj2i=1m(j=1nWijxj)2\displaystyle=\sum_{i=1}^{m}\left\|\sum_{j=1}^{n}W_{ij}x_{j}\right\|^{2}\leq\sum_{i=1}^{m}\left(\sum_{j=1}^{n}\|W_{ij}x_{j}\|\right)^{2}
Ki=1m(j=1nWijparxj)2\displaystyle\leq K\sum_{i=1}^{m}\left(\sum_{j=1}^{n}\|W_{ij}\|_{\mathrm{par}}\|x_{j}\|\right)^{2}
Ki=1m(j=1nWijpar2j=1nxj2)\displaystyle\leq K\sum_{i=1}^{m}\left(\sum_{j=1}^{n}\|W_{ij}\|_{\mathrm{par}}^{2}\cdot\sum_{j=1}^{n}\|x_{j}\|^{2}\right)
=K𝐖par2𝐱2,\displaystyle=K\|\mathbf{W}\|_{\mathrm{par}}^{2}\|\mathbf{x}\|^{2},

which is our desired result. ∎

A direct consequence of Lemma 3.3 is that, if 𝐖\mathbf{W} and 𝐱\mathbf{x} are independent, then we have

𝔼𝐖𝔼𝐱[𝐖𝐱2]K𝔼[𝐖par2]𝔼[𝐱2].\mathbb{E}_{\mathbf{W}}\mathbb{E}_{\mathbf{x}}\left[\|\mathbf{W}\mathbf{x}\|^{2}\right]\leq K\mathbb{E}\left[\|\mathbf{W}\|_{\mathrm{par}}^{2}\right]\mathbb{E}\left[\|\mathbf{x}\|^{2}\right]. (3.5)

In the following lemma, we show that the above estimate can be improved up to a multiplicative factor 1/n1/n if we additionally assume that WijW_{ij} has zero mean and that some random variables are independent and/or identically distributed.

Lemma 3.4.

Let 𝐖=[Wij]\mathbf{W}=[W_{ij}] be a standard convolution that operates on nn channels and produces mm channels, and let 𝐱=[xj]\mathbf{x}=[x_{j}] be an nn-channel input. Assume that 𝐖\mathbf{W} and 𝐱\mathbf{x} satisfy the following:

  1. (i)

    {Wij}\{W_{ij}\} are i.i.d. with 𝔼[Wij]=0\mathbb{E}[W_{ij}]=0.

  2. (ii)

    {xj}\{x_{j}\} are identically distributed.

  3. (iii)

    𝐖\mathbf{W} and 𝐱\mathbf{x} are independent.

Then we have

𝔼𝐖𝔼𝐱[𝐖𝐱2]Kn𝔼[𝐖par2]𝔼[𝐱2].\mathbb{E}_{\mathbf{W}}\mathbb{E}_{\mathbf{x}}\left[\|\mathbf{W}\mathbf{x}\|^{2}\right]\leq\frac{K}{n}\mathbb{E}\left[\|\mathbf{W}\|_{\mathrm{par}}^{2}\right]\mathbb{E}\left[\|\mathbf{x}\|^{2}\right].
Proof.

We first prove the case m=1m=1. For iji\neq j, invoking (i) yields

𝔼𝐖𝔼𝐱[xiTW1iTW1jxj]=𝔼𝐱[xiT𝔼[W1iT]𝔼[W1j]xj]=0.\mathbb{E}_{\mathbf{W}}\mathbb{E}_{\mathbf{x}}\left[x_{i}^{\mathrm{T}}W_{1i}^{\mathrm{T}}W_{1j}x_{j}\right]=\mathbb{E}_{\mathbf{x}}\left[x_{i}^{\mathrm{T}}\mathbb{E}\left[W_{1i}^{\mathrm{T}}\right]\mathbb{E}\left[W_{1j}\right]x_{j}\right]=0. (3.6)

On the other hand, for 1jn1\leq j\leq n, by (iii) and (3.5), we have

𝔼W1j𝔼xj[W1jxj2]K𝔼[W1jpar2]𝔼[xj2]=Kn2𝔼[𝐖par2]𝔼[𝐱2],\mathbb{E}_{W_{1j}}\mathbb{E}_{x_{j}}\left[\|W_{1j}x_{j}\|^{2}\right]\leq K\mathbb{E}\left[\|W_{1j}\|_{\mathrm{par}}^{2}\right]\mathbb{E}\left[\|x_{j}\|^{2}\right]=\frac{K}{n^{2}}\mathbb{E}\left[\|\mathbf{W}\|_{\mathrm{par}}^{2}\right]\mathbb{E}\left[\|\mathbf{x}\|^{2}\right], (3.7)

where the last equality is due to (i) and (ii). By (3.6) and (3.7), we get

𝔼𝐖𝔼𝐱[𝐖𝐱2]\displaystyle\mathbb{E}_{\mathbf{W}}\mathbb{E}_{\mathbf{x}}\left[\|\mathbf{W}\mathbf{x}\|^{2}\right] =i=1nj=1n𝔼𝐖𝔼𝐱[xiTW1iTW1jxj]\displaystyle=\sum_{i=1}^{n}\sum_{j=1}^{n}\mathbb{E}_{\mathbf{W}}\mathbb{E}_{\mathbf{x}}\left[x_{i}^{\mathrm{T}}W_{1i}^{\mathrm{T}}W_{1j}x_{j}\right]
j=1nKn2𝔼[𝐖par2]𝔼[𝐱2]=Kn𝔼[𝐖par2]𝔼[𝐱2],\displaystyle\leq\sum_{j=1}^{n}\frac{K}{n^{2}}\mathbb{E}\left[\|\mathbf{W}\|_{\mathrm{par}}^{2}\right]\mathbb{E}\left[\|\mathbf{x}\|^{2}\right]=\frac{K}{n}\mathbb{E}\left[\|\mathbf{W}\|_{\mathrm{par}}^{2}\right]\mathbb{E}\left[\|\mathbf{x}\|^{2}\right],

which completes the proof of the case m=1m=1. The general case m>1m>1 can be shown by applying the case m=1m=1 to each iith output channel (1im1\leq i\leq m) and then summing over all ii. ∎

The representation matrix of the group convolution presented in (2.3) becomes more sparse as the number of groups NN increases. This suggests that the approximability of the group convolution decreases as NN increases. However, to the best of our knowledge, there has been no theoretical analysis on this. Theorem 3.5 presents the first theoretical result regarding the approximability of the group convolution.

Theorem 3.5.

For an nn-channel input 𝐱\mathbf{x}, let 𝐖𝐱\mathbf{W}\mathbf{x} and 𝐖gp𝐱\mathbf{W}_{\mathrm{gp}}\mathbf{x} denote the outputs of the standard and group convolutions given in (2.2) and (2.3), respectively. Under Assumption 3.1, we have

𝔼𝐖[inf𝐖gp𝔼𝐱[𝐖𝐱𝐖gp𝐱2]]K(11N)2𝔼[𝐖par2]𝔼[𝐱2].\mathbb{E}_{\mathbf{W}}\left[\inf_{\mathbf{W}_{\mathrm{gp}}}\mathbb{E}_{\mathbf{x}}\left[\|\mathbf{W}\mathbf{x}-\mathbf{W}_{\mathrm{gp}}\mathbf{x}\|^{2}\right]\right]\leq K\left(1-\frac{1}{N}\right)^{2}\mathbb{E}\left[\|\mathbf{W}\|_{\mathrm{par}}^{2}\right]\mathbb{E}\left[\|\mathbf{x}\|^{2}\right].

In addition, if we further assume that Assumption 3.2 holds, then we have

𝔼𝐖[inf𝐖gp𝔼𝐱[𝐖𝐱𝐖bal𝐱2]]Kn(11N)𝔼[𝐖par2]𝔼[𝐱2].\mathbb{E}_{\mathbf{W}}\left[\inf_{\mathbf{W}_{\mathrm{gp}}}\mathbb{E}_{\mathbf{x}}\left[\|\mathbf{W}\mathbf{x}-\mathbf{W}_{\mathrm{bal}}\mathbf{x}\|^{2}\right]\right]\leq\frac{K}{n}\left(1-\frac{1}{N}\right)\mathbb{E}\left[\|\mathbf{W}\|_{\mathrm{par}}^{2}\right]\mathbb{E}\left[\|\mathbf{x}\|^{2}\right].
Proof.

Take any standard convolution 𝐖=[𝐖kl]\mathbf{W}=[\mathbf{W}^{kl}] and any input 𝐱=[𝐱k]\mathbf{x}=[\mathbf{x}^{k}] as in (2.2). If we set

𝐖gp=[𝐖11𝐎𝐎𝐎𝐖22𝐎𝐎𝐎𝐖NN],\mathbf{W}_{\mathrm{gp}}=\begin{bmatrix}\mathbf{W}^{11}&\mathbf{O}&\cdots&\mathbf{O}\\ \mathbf{O}&\mathbf{W}^{22}&\cdots&\mathbf{O}\\ \vdots&\vdots&\ddots&\vdots\\ \mathbf{O}&\mathbf{O}&\cdots&\mathbf{W}^{NN}\end{bmatrix},

i.e., if we choose 𝐖gp\mathbf{W}_{\mathrm{gp}} as the block-diagonal part of 𝐖\mathbf{W}, then we have

𝔼𝐖[inf𝐖gp𝔼𝐱[𝐖𝐱𝐖gp𝐱2]]\displaystyle\mathbb{E}_{\mathbf{W}}\left[\inf_{\mathbf{W}_{\mathrm{gp}}}\mathbb{E}_{\mathbf{x}}\left[\|\mathbf{W}\mathbf{x}-\mathbf{W}_{\mathrm{gp}}\mathbf{x}\|^{2}\right]\right] 𝔼𝐖𝔼𝐱[𝐖𝐱𝐖gp𝐱2]\displaystyle\leq\mathbb{E}_{\mathbf{W}}\mathbb{E}_{\mathbf{x}}\left[\|\mathbf{W}\mathbf{x}-\mathbf{W}_{\mathrm{gp}}\mathbf{x}\|^{2}\right]
=k=1N𝔼𝐖𝔼𝐱[lk𝐖kl𝐱l2].\displaystyle=\sum_{k=1}^{N}\mathbb{E}_{\mathbf{W}}\mathbb{E}_{\mathbf{x}}\left[\left\|\sum_{l\neq k}\mathbf{W}^{kl}\mathbf{x}^{l}\right\|^{2}\right]. (3.8)

For each kk, since the map 𝐱lk𝐖kl𝐱l\mathbf{x}\mapsto\sum_{l\neq k}\mathbf{W}^{kl}\mathbf{x}^{l} is a convolution that operates on (nn/N)(n-n/N) channels (𝐱k\mathbf{x}_{k} is excluded from the input), invoking Lemmas 3.3 and 3.4 yields

𝔼𝐖𝔼𝐱[lk𝐖kl𝐱l2]Cn,N,k(lk𝔼[𝐖klpar2])(lk𝔼[𝐱l2]),\mathbb{E}_{\mathbf{W}}\mathbb{E}_{\mathbf{x}}\left[\left\|\sum_{l\neq k}\mathbf{W}^{kl}\mathbf{x}^{l}\right\|^{2}\right]\leq C_{n,N,k}\left(\sum_{l\neq k}\mathbb{E}\left[\|\mathbf{W}^{kl}\|_{\mathrm{par}}^{2}\right]\right)\left(\sum_{l\neq k}\mathbb{E}\left[\|\mathbf{x}^{l}\|^{2}\right]\right), (3.9)

where

Cn,N,k={K, under Assumption 3.1,Knn/N, under Assumptions 3.1 and 3.2.C_{n,N,k}=\begin{cases}K,&\quad\text{ under \lx@cref{creftype~refnum}{ass:data},}\\ \frac{K}{n-n/N},&\quad\text{ under \lx@cref{creftypeplural~refnum}{ass:data} and~\lx@cref{refnum}{ass:input}.}\end{cases} (3.10)

Note that the independence condition between 𝐖\mathbf{W} and 𝐱\mathbf{x} are used in (3.9).

Meanwhile, Assumption 3.1 implies that

𝔼[𝐖klpar2]=1N2𝔼[𝐖par2],𝔼[𝐱l2]=1N𝔼[𝐱2].\mathbb{E}\left[\|\mathbf{W}^{kl}\|_{\mathrm{par}}^{2}\right]=\frac{1}{N^{2}}\mathbb{E}\left[\|\mathbf{W}\|_{\mathrm{par}}^{2}\right],\quad\mathbb{E}\left[\|\mathbf{x}^{l}\|^{2}\right]=\frac{1}{N}\mathbb{E}\left[\|\mathbf{x}\|^{2}\right]. (3.11)

Finally, combination of (3.2), (3.9), and (3.11) yields

𝔼𝐖[inf𝐖gp𝔼𝐱[𝐖𝐱𝐖gp𝐱2]]Cn,N,k(11N)2𝔼[𝐖par2]𝔼[𝐱2],\mathbb{E}_{\mathbf{W}}\left[\inf_{\mathbf{W}_{\mathrm{gp}}}\mathbb{E}_{\mathbf{x}}\left[\|\mathbf{W}\mathbf{x}-\mathbf{W}_{\mathrm{gp}}\mathbf{x}\|^{2}\right]\right]\leq C_{n,N,k}\left(1-\frac{1}{N}\right)^{2}\mathbb{E}\left[\|\mathbf{W}\|_{\mathrm{par}}^{2}\right]\mathbb{E}\left[\|\mathbf{x}\|^{2}\right],

which completes the proof. ∎

As discussed above, BGC (3.1) has the additional structure that utilizes the intergroup mean 𝐱¯\bar{\mathbf{x}} to compensate for the absence of the off-diagonal parts in the group convolution. We obtain the main theoretical result summarized in Theorem 3.6, showing that BGC achieves an improved approximability estimate compared to the group convolution.

Theorem 3.6.

For an nn-channel input 𝐱\mathbf{x}, let 𝐖𝐱\mathbf{W}\mathbf{x} and 𝐖bal𝐱\mathbf{W}_{\mathrm{bal}}\mathbf{x} denote the outputs of the standard convolution and BGC given in (2.2) and (3.1), respectively. Under Assumption 3.1, we have

𝔼𝐖[inf𝐖bal𝔼𝐱[𝐖𝐱𝐖bal𝐱2]]K(11N)3𝔼[𝐖par2]𝔼[𝐱2].\mathbb{E}_{\mathbf{W}}\left[\inf_{\mathbf{W}_{\mathrm{bal}}}\mathbb{E}_{\mathbf{x}}\left[\|\mathbf{W}\mathbf{x}-\mathbf{W}_{\mathrm{bal}}\mathbf{x}\|^{2}\right]\right]\leq K\left(1-\frac{1}{N}\right)^{3}\mathbb{E}\left[\|\mathbf{W}\|_{\mathrm{par}}^{2}\right]\mathbb{E}\left[\|\mathbf{x}\|^{2}\right].

In addition, if we further assume that Assumption 3.2 holds, we have

𝔼𝐖[inf𝐖bal𝔼𝐱[𝐖𝐱𝐖bal𝐱2]]Kn(11N)2𝔼[𝐖par2]𝔼[𝐱2].\mathbb{E}_{\mathbf{W}}\left[\inf_{\mathbf{W}_{\mathrm{bal}}}\mathbb{E}_{\mathbf{x}}\left[\|\mathbf{W}\mathbf{x}-\mathbf{W}_{\mathrm{bal}}\mathbf{x}\|^{2}\right]\right]\leq\frac{K}{n}\left(1-\frac{1}{N}\right)^{2}\mathbb{E}\left[\|\mathbf{W}\|_{\mathrm{par}}^{2}\right]\mathbb{E}\left[\|\mathbf{x}\|^{2}\right].
Proof.

Take any standard convolution 𝐖=[𝐖kl]\mathbf{W}=[\mathbf{W}^{kl}] and any input 𝐱=[𝐱k]\mathbf{x}=[\mathbf{x}^{k}] as in (2.2). If we choose 𝐖bal\mathbf{W}_{\mathrm{bal}} as

𝐖bal𝐱=[𝐖11𝐎𝐎𝐎𝐖22𝐎𝐎𝐎𝐖NN][𝐱1𝐱2𝐱N]+[l1𝐖1ll2𝐖2llN𝐖Nl]𝐱¯,\mathbf{W}_{\mathrm{bal}}\mathbf{x}=\begin{bmatrix}\mathbf{W}^{11}&\mathbf{O}&\cdots&\mathbf{O}\\ \mathbf{O}&\mathbf{W}^{22}&\cdots&\mathbf{O}\\ \vdots&\vdots&\ddots&\vdots\\ \mathbf{O}&\mathbf{O}&\cdots&\mathbf{W}^{NN}\end{bmatrix}\begin{bmatrix}\mathbf{x}^{1}\\ \mathbf{x}^{2}\\ \vdots\\ \mathbf{x}^{N}\end{bmatrix}+\begin{bmatrix}\sum_{l\neq 1}\mathbf{W}^{1l}\\ \sum_{l\neq 2}\mathbf{W}^{2l}\\ \vdots\\ \sum_{l\neq N}\mathbf{W}^{Nl}\end{bmatrix}\bar{\mathbf{x}},

then we get

𝐖𝐱𝐖bal𝐱2=k=1Nlk𝐖kl(𝐱l𝐱¯)2.\|\mathbf{W}\mathbf{x}-\mathbf{W}_{\mathrm{bal}}\mathbf{x}\|^{2}=\sum_{k=1}^{N}\left\|\sum_{l\neq k}\mathbf{W}^{kl}(\mathbf{x}^{l}-\bar{\mathbf{x}})\right\|^{2}.

Note that, in the proof of Theorem 3.5, the independence condition of {xj}\{x_{j}\} was never used. Hence, we can use the same argument as in the proof of Theorem 3.5 to obtain

𝔼𝐖[inf𝐖gp𝔼𝐱[𝐖𝐱𝐖gp𝐱2]]Cn,N,k(11N)2𝔼[𝐖par2]𝔼[𝐱𝐱¯2]=Cn,N,k(11N)2𝔼[𝐖par2](𝔼[𝐱2]N𝔼[𝐱¯2]),\begin{split}&\mathbb{E}_{\mathbf{W}}\left[\inf_{\mathbf{W}_{\mathrm{gp}}}\mathbb{E}_{\mathbf{x}}\left[\|\mathbf{W}\mathbf{x}-\mathbf{W}_{\mathrm{gp}}\mathbf{x}\|^{2}\right]\right]\\ &\leq C_{n,N,k}\left(1-\frac{1}{N}\right)^{2}\mathbb{E}\left[\|\mathbf{W}\|_{\mathrm{par}}^{2}\right]\mathbb{E}\left[\|\mathbf{x}-\bar{\mathbf{x}}\|^{2}\right]\\ &=C_{n,N,k}\left(1-\frac{1}{N}\right)^{2}\mathbb{E}\left[\|\mathbf{W}\|_{\mathrm{par}}^{2}\right]\left(\mathbb{E}\left[\|\mathbf{x}\|^{2}\right]-N\mathbb{E}\left[\|\bar{\mathbf{x}}\|^{2}\right]\right),\end{split} (3.12)

where Cn,N,kC_{n,N,k} was given in (3.10). Meanwhile, since {𝐱k}\{\mathbf{x}^{k}\} are i.i.d., we have

𝔼[𝐱¯2]=1N2𝔼[𝐱2]+N1N𝔼[𝐱1]21N2𝔼[𝐱2].\mathbb{E}\left[\|\bar{\mathbf{x}}\|^{2}\right]=\frac{1}{N^{2}}\mathbb{E}\left[\|\mathbf{x}\|^{2}\right]+\frac{N-1}{N}\left\|\mathbb{E}\left[\mathbf{x}^{1}\right]\right\|^{2}\geq\frac{1}{N^{2}}\mathbb{E}\left[\|\mathbf{x}\|^{2}\right]. (3.13)

Combining (3.12) and (3.13) yields the desired result. ∎

Remark 3.7.

By Assumption 3.1, the approximability of the group convolution has a bound of K(11/N)2K(1-1/N)^{2}. Adding Assumption 3.2 to this provides K(11/N)/nK(1-1/N)/n bound. Since nNn\geq N, the latter is a sharper bound. On the other hand, in both cases, the approximability of BGC has sharper estimates than the group convolution by (11/N)(1-1/N) factor.

3.3 Computational cost

We now consider the computational cost for a single convolutional layer with the proposed BGC. Let DD be the size of each input channel, i.e., xjDx_{j}\in\mathbb{R}^{D}. The number of scalar arithmetic operations required in a single operation of BGC is given by

2NmNnNKD+nD=2KDmnN+Dn.2Nm_{N}n_{N}KD+nD=\frac{2KDmn}{N}+Dn. (3.14)

In the right-hand side of (3.14), the first term corresponds to 2N2N blocks of convolutional layers in (3.1), and the second term corresponds to the computation of 𝐱¯\bar{\mathbf{x}}. Noting that the standard convolution and the group convolution require KDmnKDmn and KDmn/NKDmn/N scalar arithmetic operations, respectively, we conclude that the computational cost of BGC is approximately 2/N2/N of that of the standard convolution, but twice as much as that of the group convolution.

4 Comparison with existing works

Table 1: Overview of various variants of group convolution (GC), where nn and mm denote the numbers of input and output channels, respectively, NN denotes the number of groups, and KK represents the number of parameters of the single-channel convolutional layer.
Method
Intergroup
commun.
# of channels
per group
Grouping
# of parameters
Comput.
cost
GC (Krizhevsky et al., 2017) No n/Nn/N deterministic Kmn/NKmn/N O(1/N)O(1/N)
Shuffle (Zhang et al., 2018) Yes n/Nn/N deterministic Kmn/NKmn/N O(1/N)O(1/N)
Learnable GC (Huang et al., 2018) Yes n/Nn/N adaptive 2Kmn2Kmn greater
Fully learnable GC (Wang et al., 2019) Yes varies adaptive Kmn+Nn+NmKmn+Nn+Nm greater
Two-level GC (Lee et al., 2022) Yes n/Nn/N deterministic Kmn/N+Kn+NmKmn/N+Kn+Nm greater
BGC Yes n/Nn/N deterministic 2Kmn/N2Kmn/N O(1/N)O(1/N)

In this section, we compare BGC with other variants of group convolution. Various methods have been proposed to improve the performance of group convolution by enabling intergroup communication: Shuffle (Zhang et al., 2018), learnable group convolution (LGC) (Huang et al., 2018), fully learnable group convolution (FLGC) (Wang et al., 2019), and two-level group convolution (TLGC) (Lee et al., 2022). In BGC, input and output channels are evenly partitioned into groups to ensure that all groups have the same computational cost, preventing computational bottlenecks. This feature also applies to GC, Shuffle, LGC, and TLGC, but deterministically to BGC, i.e., it uses a fixed partition that is independent of convolution and input. Hence, the computational cost for partitioning into blocks is less expensive than those of LGC and FLGC, which have different partitions for each convolution.

The number of parameters in BGC decreases at a rate of O(1/N)O(1/N) as the number of groups NN increases, which is consistent with the rates of GC and Shuffle. Consequently, the computational cost of BGC can be scaled by 1/N1/N, similar to GC and Shuffle. This is contrast to LGC, FLGC, and TLGC, which have additional parameters that are not O(1/N)O(1/N), making their computational costs greater than O(1/N)O(1/N). Note that the total computational cost of LGC, FLGC, and TLGC is DKmn/N+(N1)(Kmn/N+mn/N+1)DKmn/N+(N-1)*(Kmn/N+mn/N+1), DKmn/N+(n+m)N2+Kmn/NDKmn/N+(n+m)N^{2}+Kmn/N, and DKmn/N+DKn+DNmDKmn/N+DKn+DNm, respectively.

As discussed in Theorem 3.6, the upper bound on the approximability of BGC is K(11/N)3K(1-1/N)^{3}, which is better than GC in Theorem 3.5 by a factor of (11/N)(1-1/N). To the best of our knowledge, this is the first rigorous analysis of the performance of a group convolution variant. On the other hand, there is still no approximation theory for Shuffle, LGC, FLGC, and TLGC.

Table 1 provides an overview of the comparison between BGC and the other variants of group convolution discussed above. As shown in Table 1, Shuffle (Zhang et al., 2018) shares all the advantages of BGC except for the theoretical guarantee of the approximability. However, BGC has another advantage over Shuffle. While Shuffle relies on layer propagation to have intergroup communication and provide good accuracy, BGC incorporates intergroup communication in each layer, increasing accuracy even when the network is not deep enough. This is particularly advantageous when BGC is applied to CNNs with wide but not deep layers, such as WideResNet (Zagoruyko and Komodakis, 2016). This assertion will be further supported by the numerical results in the next section.

5 Numerical experiments

In this section, we present numerical results that verify our theoretical results and demonstrate the performance of the proposed BGC. All programs were implemented in Python with PyTorch (Paszke et al., 2019) and all computations were performed on a cluster equipped with Intel Xeon Gold 6240R (2.4GHz, 24C) CPUs, NVIDIA RTX 3090 GPUs, and the operating system CentOS 7.

5.1 Verification of the approximability estimates

Refer to caption
(a) S=102S=10^{2}, 𝒩(0,1)\mathcal{N}(0,1)
Refer to caption
(b) S=103S=10^{3}, 𝒩(0,1)\mathcal{N}(0,1)
Refer to caption
(c) S=102S=10^{2}, 𝒰(1,1)\mathcal{U}(-1,1)
Refer to caption
(d) S=103S=10^{3}, 𝒰(1,1)\mathcal{U}(-1,1)
Figure 1: Graphs depicting log\log\mathcal{E} with respect to log(11/N)\log(1-1/N) for GC and the proposed BGC, where \mathcal{E} is the error measure given in (5.1) and NN is the number of groups. We draw data points from either the normal distribution 𝒩(0,1)\mathcal{N}(0,1) or the uniform distribution 𝒰(1,1)\mathcal{U}(-1,1). The number in the legend indicates the average slope. Note that N=2kN=2^{k}, k=2,,6k=2,\cdots,6.
Refer to caption
(a) S=102S=10^{2}, 𝒩(0,1)\mathcal{N}(0,1)
Refer to caption
(b) S=103S=10^{3}, 𝒩(0,1)\mathcal{N}(0,1)
Refer to caption
(c) S=102S=10^{2}, 𝒰(1,1)\mathcal{U}(-1,1)
Refer to caption
(d) S=103S=10^{3}, 𝒰(1,1)\mathcal{U}(-1,1)
Figure 2: Graphs depicting Rel./(11/N)p\mathrm{Rel.}\mathcal{E}/(1-1/N)^{p} with respect to the number of groups NN for GC (p=2)(p=2) and the proposed BGC (p=3)(p=3), where \mathcal{E} is the error measure given in (5.1). We draw data points from either the normal distribution 𝒩(0,1)\mathcal{N}(0,1) or the uniform distribution 𝒰(1,1)\mathcal{U}(-1,1).

We verify the approximability estimates of GC and BGC, showing that the estimate on approximability of BGC is better than that of GC, shown in Theorems 3.5 and 3.6. We consider a set of one-dimensional standard convolutional layers {𝐖(t)}t=1S\{\mathbf{W}^{(t)}\}_{t=1}^{S} with m=n=256m=n=256 and K=3K=3, which is generated by He initialization (He et al., 2015). We also sample SS random data points {𝐱(s)}s=1S\{\mathbf{x}^{(s)}\}_{s=1}^{S} from either the normal distribution 𝒩(0,1)\mathcal{N}(0,1) or the uniform distribution 𝒰(1,1)\mathcal{U}(-1,1). To measure the approximability of GC (m=gp)(\mathrm{m}=\mathrm{gp}) and BGC (m=bal)(\mathrm{m}=\mathrm{bal}) with respect to 𝐖\mathbf{W}, we use the following measure:

=1St=1S(min𝐖m1Ss=1S𝐖(t)𝐱(s)𝐖m𝐱(s)2),\mathcal{E}=\frac{1}{S}\sum_{t=1}^{S}\left(\min_{\mathbf{W}_{\mathrm{m}}}\frac{1}{S}\sum_{s=1}^{S}\|\mathbf{W}^{(t)}\mathbf{x}^{(s)}-\mathbf{W}_{\mathrm{m}}\mathbf{x}^{(s)}\|^{2}\right), (5.1)

which can be evaluated by conventional least squares solvers (Golub and Van Loan, 2013).

The graphs in Figure 1 depict log\log\mathcal{E} with respect to log(11/N)\log(1-1/N) at various settings for GC and BGC. One can readily observe linear relations between log\log\mathcal{E} and log(11/N)\log(1-1/N) for both GC and BGC. That is, we have an empirical formula

C(11N)γ1St=1S𝐖(t)21Ss=1S𝐱(s)2\mathcal{E}\approx C\left(1-\frac{1}{N}\right)^{\gamma}\frac{1}{S}\sum_{t=1}^{S}\|\mathbf{W}^{(t)}\|^{2}\frac{1}{S}\sum_{s=1}^{S}\|\mathbf{x}^{(s)}\|^{2} (5.2)

for some positive constants CC and γ\gamma independent of NN. From the graph, we see that γ1\gamma\approx 1 for GC and γ2\gamma\approx 2 for BGC, which confirm our theoretical results in Theorems 3.5 and 3.6. Next, we define a relative approximability

Rel.=1St=1S𝐖(t)21Ss=1S𝐱(s)2.\mathrm{Rel.}\mathcal{E}=\frac{\mathcal{E}}{\frac{1}{S}\sum_{t=1}^{S}\|\mathbf{W}^{(t)}\|^{2}\frac{1}{S}\sum_{s=1}^{S}\|\mathbf{x}^{(s)}\|^{2}}. (5.3)

In Figure 2, we plot Rel./(11/N)p\mathrm{Rel.}\mathcal{E}/(1-1/N)^{p} with respect to NN under various settings for GC and BGC, where p=1p=1 for GC and p=2p=2 for BGC. We observe that all values of Rel./(11/N)p\mathrm{Rel.}\mathcal{E}/(1-1/N)^{p} are bounded by 3.83×1033.83\times 10^{-3} independent of NN, which implies the constant CC is less than K/n11.72×103K/n\approx 11.72\times 10^{-3} in (5.2). This observation also agrees with the theoretical results in Theorems 3.5 and 3.6.

5.2 Embedding in recent CNNs

In Section 5.1, we verified that our approximability estimates are consistent with experimental results using synthetic data. Now, we examine the classification performance of BGC applied to various recent CNNs, including ResNet (He et al., 2016), WideResNet (Zagoruyko and Komodakis, 2016), ResNeXt (Xie et al., 2017), MobileNetV2 (Sandler et al., 2018), and EfficientNet (Tan and Le, 2019). The network structures used in our experiments are described below:

  • ResNet. In our experiments, we used the ResNet-50, which was used for ImageNet classifications. It uses a bottleneck structure consisting of a 1×11\times 1 standard convolution, a group convolution with a 3×33\times 3 kernel, another 1×11\times 1 standard convolution, and a skip connection.

  • WideResNet. In our experiments, we used two different structures of WideResNet: WideResNet-28-10 and WideResNet-34-2, which were used for CIFAR-10 and ImageNet classifications, respectively. It uses a residual unit consisting of two 3×33\times 3 convolutions and a skip connection.

  • ResNeXt. It uses the same structure as the ResNet except using 3×33\times 3 group convolution instead of 3×33\times 3 convolution. In our experiments, we used ResNeXt-29 (8×64d)(8\times 64\text{d}) for CIFAR-10 classification. We applied the group convolutions to two 1×11\times 1 standard convolutions.

  • MobileNetV2. The basic structure of MobileNetV2 (Sandler et al., 2018) is an inverted residual block, similar to the bottleneck in ResNeXt. However, MobileNetV2 uses a depthwise convolution (Chollet, 2017) instead of ResNeXt’s 3×33\times 3 group convolution. In our experiments, we applied group convolution to two 1×11\times 1 standard convolutions.

  • EfficientNet. It is based on MnasNet (Tan et al., 2019), a modification of MobileNetV2. Using an automated machine learning technique (Zoph and Le, 2017), EfficientNet proposes several model structures with appropriate numbers of channels and depth. It also has several variations, from b0 to b7 models, depending on the number of parameters. In our experiments, we used the most basic model b0.

Table 2: Details of CIFAR-10 and ImageNet datasets.
Dataset Image size Classes # of training / validation samples
CIFAR-10 32×3232\times 32 10 50,00050{,}000 / 10,00010{,}000
ImageNet 224×224224\times 224 1,0001{,}000 1,280,0001{,}280{,}000 / 50,00050{,}000

We evaluate the performance of various variants of group convolution on the datasets CIFAR-10 (Krizhevsky, 2009) and ImageNet ILSVRC 2012 (Deng et al., 2009). Details on these datasets are in Table 2. In addition, for the CIFAR-10 dataset, a data augmentation technique in (Lee et al., 2015) was adopted for training; four pixels are padded on each side of images and 32×3232\times 32 random crops are sampled from the padded images and their horizontal flips. For the ImageNet dataset, input images of size 224×224224\times 224 are randomly cropped from a resized image using the scale and aspect ratio augmentation of (Szegedy et al., 2015), which was implemented by (Gross and Wilber, 2016).

5.2.1 Ablation study through transfer learning

First, we will apply GC and BGC to a pre-trained network on real data to see how well BGC works compared to GC. We select a pre-trained ResNet-50 trained on the ImageNet dataset. Note that the pre-trained parameters can be found in the PyTorch library (Paszke et al., 2019). By transferring the parameters of the pre-trained standard convolution in ResNet-50, we obtained the parameters of (2.3) and (3.1), which are referred to as ResNet-50-GC and ResNet-50-BGC, respectively. We then further trained the ResNet-50-GC and ResNet-50-BGC for 3030 epochs using SGD optimizer with batch size 128128, learning rate 0.010.01, Nesterov momentum 0.90.9, and weight decay 0.00010.0001.

The classification performance of ResNet-50-GC and ResNet-50-BGC is given in Table 3. Compared to GC, the classification performance of BGC improves by as little as 4% and as much as 15%. Therefore, even for real data, it can be confirmed that BGC definitely complements the performance of the neural network degraded by GC.

Table 3: Classification errors (%) of ResNet-50-BC and ResNet-50-BGC on ImageNet dataset after transfer learning. Each network is transfer learned from the pre-trained ResNet-50 provided in PyTorch. Note that the case of N=1N=1 is the classification error of ResNet-50.
NN ResNet-50-GC ResNet-50-BGC
1 23.87
2 26.54 22.45
4 36.70 25.51
8 43.72 31.13
16 50.09 35.45

5.2.2 Computational efficiency

To verify the computational efficiency of our BGC, we conducted experiments to increase the number of groups NN of 3×33\times 3 convolution mapping 10241024 channels to 10241024 channels with input 𝐱128×1024×7×7\mathbf{x}\in\mathbb{R}^{128\times 1024\times 7\times 7}. Note that this convolution is used for ResNet-50. Table 4 shows the total memory usage, computation time, and classification errors for forward and backward propagations of convolutions equipped with GC and BGC. Note that the number of groups NN varies from 22 to 1616. The results are computed on a single GPU. First, looking at the total memory usage, BGC uses more memory than GC, but the gap narrows as NN increases. The additional memory usage of BGC occurs when computing the mean 𝐱¯\bar{\mathbf{x}} and the convolution 𝐖¯\bar{\mathbf{W}} defined in (3.1). On the other hand, looking at computation time, BGC is slower than GC, but faster than the standard convolution when N>2N>2. As can be seen in (3.14), compared to GC, BGC requires more computation time because it performs convolution twice, but as NN increases, the time decreases. Moreover, we can see that the cost of calculating the mean 𝐱¯\bar{\mathbf{x}} is small enough that it has little effect on the total computation time. Eventually, these results suggest that while BGC increases memory consumption and computation time compared to GC, it can improve performance with only a small increase in the computational cost when dealing with large channels in group convolution.

Table 4: Total memory usage, computation time, and classification errors (%) for forward and backward propagations of 3×33\times 3 convolution from 10241024 channels to 10241024 channels equipped with GC and BGC and the input 𝐱128×1024×7×7\mathbf{x}\in\mathbb{R}^{128\times 1024\times 7\times 7}. Note that the case of N=1N=1 is computed with the standard convolution.
N Total memory usage (Mb) Computation time (ms) Classification error (%)
GC BGC GC BGC GC BGC
1 60.50 7.979 23.87
2 42.50 75.75 9.056 10.467 26.54 22.45
4 33.50 48.62 4.449 4.790 36.70 25.51
8 29.00 37.06 2.120 2.648 43.72 31.13
16 26.75 31.53 2.265 2.557 50.09 35.45

5.2.3 Comparison with existing approaches

Table 5: Classification errors (%) of EfficientNet-b0, WideResNet-28-10, and ResNeXt-29 (8×64d)(8\times 64\text{d}) on the CIFAR-10 dataset equipped with GC, Shuffle, fully learnable group convolution (FLGC), two-level group convolution (TLGC), and BGC. The case of N=1N=1 corresponds to the standard convolution (SC). Note that the EfficientNet-b0 could not increase NN up to 1616 due to its structure, so is was implemented only up to 88.
NN Method EfficientNet-b0 WideResNet-28-10 ResNeXt-29 (8×64d8\times 64\text{d})
1 SC 8.42 3.88 5.32
2 GC 10.02 4.33 5.92
Shuffle 8.72 4.00 4.52
FLGC 8.63 4.62 4.68
TLGC 8.93 4.12 5.16
BGC 7.16 3.75 5.84
4 GC 11.25 5.44 6.76
Shuffle 9.41 4.21 4.17
FLGC 9.70 6.16 5.16
TLGC 9.42 4.34 6.92
BGC 7.50 4.02 5.84
NN Method EfficientNet-b0 WideResNet-28-10 ResNeXt-29 (8×64d8\times 64\text{d})
1 SC 8.42 3.88 5.32
8 GC 13.14 6.43 6.52
Shuffle 10.65 4.73 4.81
FLGC 10.49 9.81 5.25
TLGC 10.68 4.79 6.56
BGC 7.80 4.22 6.00
16 GC - 8.47 6.92
Shuffle 5.28 4.89
FLGC 11.26 6.11
TLGC 4.95 6.08
BGC 4.76 5.60
Table 6: Classification errors (%) of EfficientNet-b0, WideResNet-34-2, and MobileNetV2 on the ImageNet dataset equipped with GC, Shuffle, fully learnable group convolution (FLGC), two-level group convolution (TLGC), and BGC. The case of N=1N=1 corresponds to the standard convolution (SC) equipped to CNNs.
NN Method EfficientNet-b0 WideResNet-34-2 MobileNetV2
1 SC 30.68 24.99 34.13
2 GC 35.18 26.92 39.50
Shuffle 33.91 25.12 36.67
FLGC 35.53 30.53 37.98
TLGC 33.50 25.36 33.56
BGC 31.82 24.19 36.10
4 GC 41.64 31.96 46.15
Shuffle 38.02 26.84 38.14
FLGC 40.71 38.30 44.37
TLGC 36.38 27.17 36.88
BGC 33.78 25.36 37.82
8 GC 48.42 36.70 51.97
Shuffle 42.74 29.03 42.75
FLGC 45.38 45.54 49.28
TLGC 38.66 28.43 39.16
BGC 38.13 27.16 42.65

As benchmark group convolution variants, we choose GC (Krizhevsky et al., 2017), Shuffle (Zhang et al., 2018), FLGC (Wang et al., 2019), and TLGC (Lee et al., 2022), which were discussed in Section 4. All neural networks for the CIFAR-10 dataset in this section were trained using stochastic gradient descent with batch size 128128, weight decay 0.00050.0005, Nesterov momentum 0.90.9, total epoch 200200, and weights initialized as in (He et al., 2015). The initial learning rate was set to 0.10.1 and was reduced to its one tenth in the 60th, 120th, and 160th epochs. For ImageNet, the hyperparameter settings are the same as the CIFAR case, except for the weight decay 0.00010.0001, total epoch 9090, and the learning rate reduced by a factor of 10 in the 30th and 60th epochs.

The classification errors of BGC, along with other benchmarks, applied to various CNNs on the CIFAR-10 dataset are presented in Table 5. BGC shows better overall results than other methods. In particular, the application of BGC significantly improves the classification performance of EfficientNet-b0. In addition, the classification errors of BGC for WideResNet-28-10 are always less than 5% when NN varies up to 1616. Although Shuffle performed the best for ResNeXt-29, BGC still outperforms the classification performance of GC. To further validate the performance of BGC, we report the classification results on the ImageNet dataset in Table 6. For this dataset, we observe that BGC outperforms other benchmark group convolution variants for CNN architectures except MobileNetV2. Therefore, through several experiments, we conclude that BGC is an effective and theoretically guaranteed alternative to group convolution for various CNN architectures on large-scale datasets.

6 Conclusion

In this paper, we proposed a novel variant of group convolution called BGC. We constructed BGC by combining the plain group convolution structure with a balancing term defined as the intergroup mean to improve intergroup communication. We designed a new measure (3.2) to assess the approximability of group convolution variants and proved that the approximability of group convolution is bounded by K(11/N)2K(1-1/N)^{2}. Also, we showed that the bound on approximability of proposed BGC is K(11/N)3K(1-1/N)^{3}, which is an improved bound compared to the group convolution. Moreover, under the additional assumption about the parameters of the standard convolution, we showed that the bounds for the approximability of group convolution and BGC are K(11/N)/nK(1-1/N)/n and K(11/N)2/nK(1-1/N)^{2}/n, respectively. Numerical experiments with various CNNs such as WideResNet, MobileNetV2, ResNeXt, and EfficientNet have demonstrated the practical efficacy of BGC on large-scale neural networks and datasets.

We conclude this paper with a remark on BGC. A major drawback of the proposed BGC is that it requires full data communication among groups. This means that when computing the intergroup mean 𝐱¯\bar{\mathbf{x}} that appears in the balancing term, we need the entire input 𝐱\mathbf{x}. This high volume of communication can be a bottleneck in parallel computation, which limits the performance of the model in distributed memory systems. We note that TLGC (Lee et al., 2022) has addressed this issue by minimizing the amount of communication required. Exploring how to improve BGC by reducing communication in a similar manner to (Lee et al., 2022), while maintaining strong performance in both theory and practice, is considered as a future work.

Acknowledgments

This work was supported in part by the National Research Foundation (NRF) of Korea grant funded by the Korea government (MSIT) (No. RS-2023-00208914), and in part by Basic Science Research Program through NRF funded by the Ministry of Education (No. 2019R1A6A1A10073887).

References

  • Brock et al. (2021) Brock, A., De, S., Smith, S.L., Simonyan, K., 2021. High-performance large-scale image recognition without normalization, in: Proceedings of the 38th International Conference on Machine Learning, PMLR. pp. 1059–1071.
  • Chollet (2017) Chollet, F., 2017. Xception: Deep learning with depthwise separable convolutions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L., 2009. ImageNet: A large-scale hierarchical image database, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255.
  • Gholami et al. (2022) Gholami, A., Kim, S., Zhen, D., Yao, Z., Mahoney, M., Keutzer, K., 2022. Low-Power Computer Vision: Improve the Efficiency of Artificial Intelligence (1st ed.). Chapman and Hall/CRC, Boca Raton, FL. chapter 13. pp. 291–326.
  • Glorot and Bengio (2010) Glorot, X., Bengio, Y., 2010. Understanding the difficulty of training deep feedforward neural networks, in: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, PMLR. pp. 249–256.
  • Golub and Van Loan (2013) Golub, G.H., Van Loan, C.F., 2013. Matrix Computations. Johns Hopkins University Press, Baltimore.
  • Gross and Wilber (2016) Gross, S., Wilber, M., 2016. Training and investigating residual nets. Facebook AI Research 6.
  • He et al. (2022) He, J., Li, L., Xu, J., 2022. Approximation properties of deep ReLU CNNs. Research in the Mathematical Sciences 9, 38.
  • He et al. (2015) He, K., Zhang, X., Ren, S., Sun, J., 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV).
  • He et al. (2016) He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Hu et al. (2018) Hu, J., Shen, L., Sun, G., 2018. Squeeze-and-Excitation networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Huang et al. (2018) Huang, G., Liu, S., van der Maaten, L., Weinberger, K.Q., 2018. CondenseNet: An efficient DenseNet using learned group convolutions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Huang et al. (2019) Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., Chen, M., Lee, H., Ngiam, J., Le, Q.V., Wu, Y., Chen, z., 2019. GPipe: Efficient training of giant neural networks using pipeline parallelism, in: Advances in Neural Information Processing Systems, Curran Associates, Inc.
  • Krizhevsky (2009) Krizhevsky, A., 2009. Learning multiple layers of features from tiny images. Technical Report. University of Toronto.
  • Krizhevsky et al. (2017) Krizhevsky, A., Sutskever, I., Hinton, G.E., 2017. ImageNet classification with deep convolutional neural networks. Communications of the ACM 60, 84–90.
  • Lee et al. (2015) Lee, C.Y., Xie, S., Gallagher, P., Zhang, Z., Tu, Z., 2015. Deeply-Supervised Nets, in: Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, PMLR. pp. 562–570.
  • Lee et al. (2022) Lee, Y., Park, J., Lee, C.O., 2022. Two-level group convolution. Neural Networks 154, 323–332.
  • Liu et al. (2020) Liu, J., Tripathi, S., Kurup, U., Shah, M., 2020. Pruning algorithms to accelerate convolutional neural networks for edge applications: A survey. arXiv preprint arXiv:2005.04275 .
  • Long et al. (2015) Long, J., Shelhamer, E., Darrell, T., 2015. Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S., 2019. PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32, 8024–8035.
  • Radford et al. (2015) Radford, A., Metz, L., Chintala, S., 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 .
  • Ronneberger et al. (2015) Ronneberger, O., Fischer, P., Brox, T., 2015. U-Net: Convolutional networks for biomedical image segmentation, in: Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, Springer, Cham. pp. 234–241.
  • Sandler et al. (2018) Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C., 2018. MobileNetV2: Inverted residuals and linear bottlenecks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Szegedy et al. (2015) Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A., 2015. Going deeper with convolutions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Tan et al. (2019) Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., Le, Q.V., 2019. MnasNet: Platform-aware neural architecture search for mobile, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  • Tan and Le (2019) Tan, M., Le, Q., 2019. EfficientNet: Rethinking model scaling for convolutional neural networks, in: Proceedings of the 36th International Conference on Machine Learning, PMLR. pp. 6105–6114.
  • Wang et al. (2019) Wang, X., Kan, M., Shan, S., Chen, X., 2019. Fully learnable group convolution for acceleration of deep neural networks, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  • Xie et al. (2017) Xie, S., Girshick, R., Dollar, P., Tu, Z., He, K., 2017. Aggregated residual transformations for deep neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Zagoruyko and Komodakis (2016) Zagoruyko, S., Komodakis, N., 2016. Wide residual networks, in: Proceedings of the British Machine Vision Conference (BMVC), BMVA Press. pp. 87.1–87.12.
  • Zhang et al. (2017) Zhang, K., Zuo, W., Chen, Y., Meng, D., Zhang, L., 2017. Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising. IEEE Transactions on Image Processing 26, 3142–3155.
  • Zhang et al. (2018) Zhang, X., Zhou, X., Lin, M., Sun, J., 2018. Shufflenet: An extremely efficient convolutional neural network for mobile devices, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Zhang et al. (2015a) Zhang, X., Zou, J., He, K., Sun, J., 2015a. Accelerating very deep convolutional networks for classification and detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 1943–1955.
  • Zhang et al. (2015b) Zhang, X., Zou, J., Ming, X., He, K., Sun, J., 2015b. Efficient and accurate approximations of nonlinear convolutional networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • Zoph and Le (2017) Zoph, B., Le, Q., 2017. Neural architecture search with reinforcement learning, in: International Conference on Learning Representations.