This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Sparse Optimization for Green Edge AI Inference

Xiangyu Yang,  Sheng Hua,  Yuanming Shi*,   Hao Wang,   Jun Zhang,  and Khaled B. Letaief X. Yang and S. Hua are with the School of Information Science and Technology, ShanghaiTech University, Shanghai 201210, China, also with the Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, Shanghai 200050, China, and also with the University of Chinese Academy of Sciences, Beijing 100049, China. (e-mail: {yangxy3, huasheng}@shanghaitech.edu.cn).Y. Shi and H. Wang are with the School of Information Science and Technology, ShanghaiTech University, Shanghai, China (e-mail: {shiym, wanghao1}@shanghaitech.edu.cn).J. Zhang is with the Department of Electronic and Information Engi- neering, The Hong Kong Polytechnic University, Hong Kong (e-mail: [email protected]).K. B. Letaief is with the Department of Electronic and Computer Engineer- ing, Hong Kong University of Science and Technology, Hong Kong (e-mail: [email protected]). He is also with Peng Cheng Laboratory, Shenzhen, China.*(Corresponding author: Yuanming Shi.)Part of this work was presented at the IEEE 90th Vehicular Technology Conference (VTC2019-Fall), Honolulu, Hawaii, USA, Sept. 2019 [1].
Abstract

With the rapid upsurge of deep learning tasks at the network edge, effective edge artificial intelligence (AI) inference becomes critical to provide low-latency intelligent services for mobile users via leveraging the edge computing capability. In such scenarios, energy efficiency becomes a primary concern. In this paper, we present a joint inference task selection and downlink beamforming strategy to achieve energy-efficient edge AI inference through minimizing the overall power consumption consisting of both computation and transmission power consumption, yielding a mixed combinatorial optimization problem. By exploiting the inherent connections between the set of task selection and group sparsity structural transmit beamforming vector, we reformulate the optimization as a group sparse beamforming problem. To solve this challenging problem, we propose a log-sum function based three-stage approach. By adopting the log-sum function to enhance the group sparsity, a proximal iteratively reweighted algorithm is developed. Furthermore, we establish the global convergence analysis and provide the ergodic worst-case convergence rate for this algorithm. Simulation results will demonstrate the effectiveness of the proposed approach for improving energy efficiency in edge AI inference systems.

Index Terms:
AI, edge inference, cooperative transmission, energy efficiency, group sparse beamforming, proximal iteratively reweighted algorithm.

I Introduction

THE availability of big data and computing power, along with the advances in the optimization algorithms, has triggered a booming era of artificial intelligence (AI). Notably, deep learning [2] is regarded as the most popular sector in modern AI and has achieved exciting breakthroughs in applications such as speech recognition, computer vision [3], etc. Benefiting from these achievements, AI is becoming a promising tool that streamlines people’s decision-making process and facilitates the development of diversified intelligence services (e.g., virtual personal assistant, recommendation system, etc). Meanwhile, with the proliferation of mobile computing as well as Internet-of-Things (IoT) devices, massive real-time data are generated locally [4]. However, it is widely acknowledged that traditional cloud-based computing [5, 6] faces challenges (e.g., latency, privacy and network congestion) for supporting the ubiquitous AI-empowered applications on mobile devices [7].

In contrast, edge AI is a promising approach, which can tackle the above concerns, via fusing mobile edge computing [8] with AI-enabled techniques (e.g., deep neural networks (DNNs)). By pushing AI models to the network edge, it brings the edge servers close to the requesting mobile devices and thus enables low-latency and privacy-preserving. Notably, edge AI is envisioned as the key ingredient of future intelligent 66G networks [9, 10, 11, 12], which fully unleashes the potentials for mobile communication and computation. Typically, edge AI consists of two phases of edge training and edge inference. As for the edge training phase, the training of AI models can be performed on cloud, edge or end devices [13], however, this is beyond the scope of this work. By deploying trained AI models and implementing model inference at network edge, this paper mainly focuses on edge inference. Following [7, 14], the edge AI inference architecture is generally classified into three major types:

  • On-device inference: It performs the model inference directly on end devices where DNN models are deployed. While some enabling techniques (e.g., model compression [15, 16], hardware speedup [17]) have been proposed to facilitate the deployment of the DNN model, it still poses challenges for resource-limited (e.g., memory, power budget and computation) end devices [18]. To mitigate such concerns, on-device distributed computing is envisioned as a promising solution for on-device inference, which enables AI model inference across multiple distributed end devices [19].

  • Joint device-edge inference: This mode carries out the AI model inference in a device-edge cooperation fashion [7] with the model partition and model early-exit techniques [20]. While device-edge cooperation is flexible and enables low correspondence-latency edge inference, it may still have high resource requirements for end devices due to the resource-demanding nature of DNNs [21].

  • Edge server inference: Such methods transfer the raw input data to edge serves for processing, which then return inference results to end-users [22, 23]. Edge server inference is particularly suitable for those computation-intensive tasks. Nonetheless, the inference performance relies mainly on the channel bandwidth between the edge server and end devices. Cooperative transmission [24] becomes promising for communication-efficient inference results delivery.

To support those computation-tasks on the resource-limited end devices, edge server inference stands out as a viable solution to fulfill the key performance requirements. The main focus of this paper is on the AI model inference for mobile devices with the edge server inference architecture. For the edge AI inference system, energy efficiency is a key performance indicator [14], which motives us to focus on the energy-efficient edge inference design. This is achieved by optimizing the overall network power consumption, including computation power consumption for performing inference tasks and transmission power consumption for returning inference results. In particular, cooperative transmission [24] is a widely recognized technique to reduce the downlink transmit power consumption and provide low-latency transmission services by exploiting the high beamforming gains for edge AI inference. In this work, we thus consider that multiple edge base stations (BSs) collaboratively transmit the inference results to the end devices [22]. To enable transmission cooperation, we apply the computation replication principle [25], i.e., the inference tasks from end devices can be performed by several neighboring edge BSs to create multiple copies of the inference results. However, computation replication greatly increases the power consumption in performing inference tasks. Therefore, it is necessary to select the inference tasks to be performed by each edge BS to achieve an optimal balance between communication and computation power consumption.

In this paper, we propose a joint inference task selection and downlink beamforming strategy towards achieving energy-efficient edge AI inference by optimizing the overall network power consumption consisting of the computation power consumption and the transport network power consumption under the quality-of-service (QoS) constraints. However, the resulting formulation contains combinatorial variables and nonconvex constraints, which makes it computationally intractable. To address this issue, we observe that the transmit beamforming vector has an intrinsic connection with the set of inference task selection (i.e., tasks are opted by edge servers to execute). Based on this crucial observation, we present a group sparse beamforming (GSBF) reformulation, followed by proposing a log-sum function based three-stage GSBF approach. In particular, in the first stage, we adopt a weighted log-sum function based relaxation to enhance the group sparsity of the structural solutions.

Nonetheless, the log-sum function minimization problem poses challenges in computation and analysis. To resolve the issues, we present a proximal iteratively reweighted algorithm, which solves a sequence of weighted convex subproblems. Moreover, we establish the global convergence analysis and worst-case convergence rate analysis of the presented proximal iteratively reweighted algorithm. Specifically, by leveraging the Fréchet subdifferential [26], we characterize the first-order necessary optimality conditions of the formulated convex-constrained log-sum problem. We then show that the generated iterates of the proposed algorithm make the function values steadily decrease and prove that any cluster point of the generated entire sequence is a critical point of the initial objective for any initial feasible point. Finally, we show that the defined optimality residual has O(1/t)O(1/t) ergodic worst-case convergence rate, where tt is the iteration counter.

In the following, we summarize the major contributions of this paper as follows.

  • 1.

    We propose a joint task selection and downlink beamforming strategy to optimize the trade-off between computation and communication power consumption for an energy-efficient edge AI inference system. In particular, task selection is achieved by controlling the group sparsity structure of the transmit beamforming vector, thereby formulating a group sparse beamforming problem under the target QoS constraints.

  • 2.

    To solve the resulting optimization problem, we proposed a log-sum function based three-stage GSBF approach. In particular, we adopt a weighted log-sum approximation to enhance the group sparsity of the transmit beamforming vector in the first stage. Moreover, we propose a proximal iteratively reweighted algorithm to solve the log-sum minimization problem.

  • 3.

    For the presented proximal iteratively reweighted algorithm, we establish the global convergence analysis. We prove that every cluster point generated by the presented algorithm satisfies the first-order necessary optimality condition for the original nonconvex log-sum problem. Furthermore, a worst-case O(1/t)O(1/t) convergence rate is established for this algorithm in an ergodic sense.

  • 4.

    Numerical experiments are conducted to demonstrate the effectiveness and competitive performance of the log-sum function based three-stage GSBF approach for designing the green edge AI inference system.

I-A Related Works

The study of inducing sparsity generally falls into the sparse optimization category [27, 28]. In particular, sparse optimization, emerging as a powerful tool, has recently contributed to the effective design of wireless networks, e.g., group sparse beamforming for energy-efficient cloud radio access networks [29, 30], and sparse signal processing for Internet-of-Things (IoT) networks [31, 32]. In particular, to induce the group sparsity structure of the beamforming vector, the work of [22, 23] adopted the mixed 1,2\ell_{1,2}-norm. As illustrated in [27], the mixed 1,q\ell_{1,q}-norms (q>1q>1) can induce the group sparsity structure of the interested solution. Moreover, the mixed 1,2\ell_{1,2}-norm and 1,\ell_{1,\infty}-norm [33] are commonly adopted. However, the effectiveness of sparsity based on convex sparsity-inducing norms is not satisfactory since there always exists some small nonzero elements in the obtained solutions [34]. In contrast to these works, some works applied nonconvex sparsity-inducing functions to seek sparser solutions [35]. Notably, the work [34] reported the capability of log-sum function for enhancing the sparsity of the solutions.

Motivated by their superior performance on inducing sparsity, we adopt log-sum functions to promote the sparsity pattern in the solutions. However, adopting the log-sum function to enhance sparsity usually makes the problem difficult to compute and analyze. In [34], the authors first proposed an iteratively reweighted 1\ell_{1} algorithm (IRL1) for tackling the nonconvex and nonsmooth log-sum functions with linear constraints. Nonetheless, they did not further conduct the convergence analysis for the proposed method. Under reasonable assumptions, the work of [36] established the convergence results for a class of unconstrained nonconvex nonsmooth problems based on the limiting-subgradient tool. In particular, these results could apply to the log-sum model in an unconstrained setting. In [37], they proposed a proximal iteratively reweighted algorithm and proved that any accumulation point is a critical point. The work of [38] further showed that, for any starting feasible points, the sequence generated by their proximal iteratively reweighted algorithm could converge to a critical point under the Kurdyka-Łojasiewicz (KL) property [39]. However, these works focused on the unconstrained formulation or linearly constrained cases when the log-sum model is involved. The theoretical analysis for the log-sum function with general convex-set constraints has not been investigated.

I-B Organization

The remainder of this paper is organized as follows. Section II presents the system model of the edge AI inference, followed by the problem formulation and analysis. Section II-C provides the group sparse beamforming formulation. The log-sum function based three-stage GSBF approach is proposed in  Section III. Section IV provides the global convergence and convergence rate analysis of the proposed proximal iteratively reweighted algorithm. Section V demonstrates the performance of the proposed approach. The conclusion remark is made in Section VI. To keep the main text coherent and free of technical details, we divert most of the mathematic proofs to the Appendices.

I-C Notation

Throughout this paper, we subsume the notation used as follows. We use n\mathbb{C}^{n} and n\mathbb{R}^{n} to denote the complex vector space and the real Euclidean nn-space n\mathbb{R}^{n}, respectively. Boldface lower-case letters and upper case letters to represent vectors (e.g., 𝒙\bm{x}) and matrices (e.g., 𝑰\bm{I}) with an appropriate size, respectively. The inner product between 𝒙,𝒚n\bm{x},\bm{y}\in\mathbb{C}^{n} is denoted as 𝒙,𝒚\langle\bm{x},\bm{y}\rangle. 1\|\cdot\|_{1} and 2\|\cdot\|_{2} is the conventionally defined 1\ell_{1}-norm and 2\ell_{2}-norm for any vectors in n\mathbb{C}^{n}, respectively. In addition, we use ()𝖧(\cdot)^{\sf{H}} and ()𝖳(\cdot)^{\sf{T}} to denote the Hermitian and transpose operators, respectively. ()\mathfrak{R}(\cdotp) is the real part of a complex scalar. 𝟏\bm{1} is a vector with all components equal to 1 and 𝟎\bm{0} denotes the zero vector with an appropriate size. In particular, 𝒗𝒢2=[𝒗12,,𝒗n2]𝖳n\|\bm{v}\|_{\mathcal{G}_{2}}=\left[\|\bm{v}_{1}\|_{2},\cdots,\|\bm{v}_{n}\|_{2}\right]^{\sf{T}}\in\mathbb{R}^{n} represents a vector whose nnth element is the 2\ell_{2}-norm of a structured vector 𝒗nn\bm{v}_{n}\in\mathbb{C}^{n}. We use \circ to denote composition operation between two functions and symbol \odot defines the elementwise product for any two vectors 𝒙,𝒚n\bm{x},\bm{y}\in\mathbb{C}^{n}.

For any closed convex set 𝒞n\mathcal{C}\subset\mathbb{C}^{n}, we use δ𝒞(𝒄)\delta_{\mathcal{C}}(\bm{c}) to denote the characteristic function associated with 𝒞\mathcal{C}, which is defined as

δ𝒞(𝒄):={0,𝒄𝒞,+,𝒄𝒞.\delta_{\mathcal{C}}(\bm{c}):=\left\{\begin{array}[]{lc}0,&\bm{c}\in\mathcal{C},\\ +\infty,&\bm{c}\notin\mathcal{C}.\end{array}\right.

Similarly, 𝕀()\mathbb{I}(\cdot) defines a indicator function associated with the given condition \cdot, i.e., if condition \cdot is met, then return the value 11; otherwise, return the value 0. Moreover, z𝒞𝒩(μ,σ2)z\sim\mathcal{CN}(\mu,\sigma^{2}) corresponds to the complex random variable zz with mean μ\mu and variance σ2\sigma^{2}.

II System Model and Problem Formulation

This section describes the overall system model and power consumption model for performing intelligent tasks in the considered edge AI inference system, followed by the problem formulation and analysis.

II-A System Model

We consider an edge computing system consisting of NN LL-antenna BSs collaboratively serving KK single-antenna mobile users (MUs), as illustrated in Fig. 1. These deployed BSs are used as dedicated edge nodes and have access to the enormous computation and storage resources [8]. For convenience, define 𝒦={1,,K}\mathcal{K}=\{1,\dots,K\} and 𝒩={1,,N}\mathcal{N}=\{1,\dots,N\} as the index sets of MUs and BSs, respectively. MUs have inference computing tasks, and the results can be inferred from task-related DNNs. For ease of expression, we use 𝒅k\bm{d}_{k} to denote the raw input data collected from MU kk, and the corresponding inference results are represented as ϕk(𝒅k)\phi_{k}(\bm{d}_{k}). As performing intelligent tasks on DNNs are typically resource-demanding, it is usually impractical to perform the tasks on resource-constrained mobile devices locally. In the proposed edge AI inference system, by exploiting the computation replication [25], we consider the scenario that each neighboring edge BS has collected the raw input data {𝒅k}\{\bm{d}_{k}\} from all MUs. Then the edge BSs process the data {𝒅k}\{\bm{d}_{k}\} for model inference. After the edge BSs complete the model inference, the inference results {ϕk(𝒅k)}\{\phi_{k}(\bm{d}_{k})\} are returned to the corresponding MUs via the downlink channels. We assume that all edge BSs have been equipped with the pre-trained deep network models for all inference tasks [23].

In the downlink transmission, the edge BSs, which perform the inference tasks for the same MU cooperatively, return the inference results to the MU. We assume perfect channel state information (CSI) is available to all edge BSs to enable cooperative transmission for the inference results [24]. Let 𝒜n𝒦\mathcal{A}_{n}\subseteq\mathcal{K} denote the indexes of MUs whose tasks are selectively performed by BS nn, and 𝒜=(𝒜1,,𝒜N)\mathcal{A}=(\mathcal{A}_{1},\cdots,\mathcal{A}_{N}) represents task selection strategy.

Refer to caption
Figure 1: System model illustration of the edge AI inference for intelligent tasks. This paper considers the scenario that each neighboring edge BS has collected the raw input data {𝒅k}\{\bm{d}_{k}\} from mobile users.

II-A1 Downlink Transmission Model

Let sks_{k}\in\mathbb{C} denote the encoded scalar of the requested output {ϕk(𝒅k)}\{\phi_{k}(\bm{d}_{k})\} for MU kk, and 𝒗nkL\bm{v}_{nk}\in\mathbb{C}^{L} be the transmit beamforming vector at the BS nn for sks_{k}. For convenience, and without loss of generality, we assume that 𝔼(|sk|2)=1\mathbb{E}(|s_{k}|^{2})=1, i.e., the power of sks_{k} is normalized to the unit. The transmitted signal 𝒙n\bm{x}_{n} at BS nn can be expressed as

𝒙n=k𝒜n𝒗nksk.\bm{x}_{n}=\sum_{k\in\mathcal{A}_{n}}\bm{v}_{nk}s_{k}. (1)

Let 𝒉nkL\bm{h}_{nk}\in\mathbb{C}^{L} be the propagation channel coefficient vector between BS nn and MU kk. The received signal at MU kk denoted as yky_{k}, is then given by

yk\displaystyle y_{k} =n𝒩𝒉nk𝖧𝒙n+zk\displaystyle=\sum_{n\in\mathcal{N}}\bm{h}_{nk}^{\sf{H}}\bm{x}_{n}+z_{k} (2)
=n𝒩𝒉nk𝖧l𝒜n𝒗nlsl+zk\displaystyle=\sum_{n\in\mathcal{N}}\bm{h}_{nk}^{\sf{H}}\sum_{l\in\mathcal{A}_{n}}\bm{v}_{nl}s_{l}+z_{k}
=n𝒩𝒉nk𝖧[𝕀(k𝒜n)𝒗nksk+l𝒜n,lk𝒗nlsl]+zk,\displaystyle=\sum_{n\in\mathcal{N}}\bm{h}_{nk}^{\sf{H}}\!\left[\mathbb{I}(k\in\mathcal{A}_{n})\bm{v}_{nk}s_{k}\!+\!\sum_{l\in\mathcal{A}_{n},l\neq k}\!\bm{v}_{nl}s_{l}\right]+z_{k},

where zk𝒞𝒩(0,σk2)z_{k}\sim\mathcal{CN}(0,\sigma^{2}_{k}) is the isotropic additive white Gaussian noise.

We assume that all data symbols sks_{k} are mutually independent of each other as well as noise. Based on (2), the signal-to-interference-plus-noise ratio (SINR) for MU kk is therefore given as

SINRk(𝒜)=|n𝒩𝕀(k𝒜n)𝒉nk𝖧𝒗nk|2lk|n𝒩𝕀(l𝒜n)𝒉nk𝖧𝒗nl|2+σk2.\textrm{SINR}_{k}(\mathcal{A})=\frac{|\sum_{n\in\mathcal{N}}\mathbb{I}(k\in\mathcal{A}_{n})\bm{h}_{nk}^{\sf{H}}\bm{v}_{nk}|^{2}}{\sum_{l\neq k}|\sum_{n\in\mathcal{N}}\mathbb{I}(l\in\mathcal{A}_{n})\bm{h}_{nk}^{\sf{H}}\bm{v}_{nl}|^{2}+\sigma_{k}^{2}}. (3)

II-A2 Power Consumption Model

The computation and transmission power consumption for model inference is generally large. Energy efficiency is of significant importance for an energy-efficient edge AI inference system design, for which the overall network power consumed in computation and communication at the edge BSs becomes our main interest. Specifically, we express the total transmission power for all edge BSs in the downlink as

Ptrans(𝒜,{𝒗nk}):\displaystyle P_{\textrm{trans}}(\mathcal{A},\{\bm{v}_{nk}\}): =n=1N1ηn𝔼[k𝒜n𝒗nksk22]\displaystyle=\sum_{n=1}^{N}\frac{1}{\eta_{n}}\mathbb{E}\bigg{[}\sum_{k\in\mathcal{A}_{n}}\|\bm{v}_{nk}s_{k}\|_{2}^{2}\bigg{]} (4)
=n=1Nk𝒜n1ηn𝒗nk22,\displaystyle=\sum_{n=1}^{N}\sum_{k\in\mathcal{A}_{n}}\frac{1}{\eta_{n}}\|\bm{v}_{nk}\|_{2}^{2},

where ηn\eta_{n} is the radio frequency power amplifier efficiency coefficient of edge BS nn.

In addition to the downlink transmission power consumption, the power consumed in performing AI inference tasks should be taken into consideration as well, owing to the power-demanding nature of running DNNs. We use Pnk𝖼P_{nk}^{\sf{c}} to denote the computation power consumption of the BS nn in performing inference task ϕk\phi_{k}. Then the computation power consumed by all BSs are given by

Pcomp(𝒜):=n𝒩k𝒜nPnk𝖼.P_{\textrm{comp}}(\mathcal{A}):=\sum_{n\in\mathcal{N}}\sum_{k\in\mathcal{A}_{n}}P_{nk}^{\sf{c}}. (5)

For the estimation of the computation energy consumption in executing task ϕk\phi_{k} therein, the works [40, 41] stated that the energy consumption of a deep neural network layer for inference mainly including computation energy consumption and data movement energy consumption. For illustration, we take GoogLeNet v1 [42] as a concrete example to illustrate the energy consumed by performing inference tasks. Specifically, we use GoogLeNet v1 to perform image classification tasks on the Eyeriss chip [43]. With the help of an energy estimation online tool [44], we are able to visualize the energy consumption breakdown of the GoogLeNet v1, as illustrated in Fig. 2. We obtain the estimation of the computation power consumption via dividing the total energy consumption by the computation time. In particular, the computation time is determined by the total number of multiplication-and-accumulation (MAC) operations and the peak throughout of Eyeriss chip.

Refer to caption
Figure 2: The estimated energy consumption breakdown [44] of the GoogLeNet v1 to perform image classification tasks on the Eyeriss chip [43].

Therefore, the overall power consumption for edge AI inference, including transmission and computation power consumption, is calculated as

Poverall(𝒜,{𝒗nk})\displaystyle P_{\textrm{overall}}(\mathcal{A},\{\bm{v}_{nk}\}) =Ptrans(𝒜,{𝒗nk})+Pcomp(𝒜)\displaystyle\!=\!P_{\textrm{trans}}(\mathcal{A},\{\bm{v}_{nk}\})+P_{\textrm{comp}}(\mathcal{A}) (6)
=n𝒩k𝒜n1ηn𝒗nk22+n𝒩k𝒜nPnk𝖼.\displaystyle=\!\sum_{n\in\mathcal{N}}\!\sum_{k\in\mathcal{A}_{n}}\!\frac{1}{\eta_{n}}\|\bm{v}_{nk}\|_{2}^{2}\!+\!\!\sum_{n\in\mathcal{N}}\!\sum_{{k\in\mathcal{A}_{n}}}\!P_{nk}^{\sf{c}}.

II-B Problem Formulation and Analysis

Note that there is a fundamental trade-off between transmission and computation power consumption. To be specific, more edge BSs performing the same task for MUs can significantly reduce the transmission power by exploiting higher transmit beamforming gains. However, this inevitably increases the computation power consumption for performing inference tasks. Therefore, the goal of an energy-efficient edge inference system can be achieved by minimizing the overall network power consumption to reach a balance between these two parts of power consumption.

Let {γk}\{\gamma_{k}\} be the target SINR for MUs to receive the reliable AI inference results in the downlink successfully. In our proposed energy-efficient edge AI inference system, the overall power minimization problem is thus formulated as

min𝒜,{𝒗nk}\displaystyle\min_{\begin{subarray}{c}\mathcal{A},\{\bm{v}_{nk}\}\end{subarray}}{} Poverall(𝒜,{𝒗nk})\displaystyle P_{\textrm{overall}}(\mathcal{A},\{\bm{v}_{nk}\}) (7)
s.t. SINRk(𝒜)γk,k𝒦,\displaystyle\text{SINR}_{k}(\mathcal{A})\geq\gamma_{k},~{}\forall\,k\in\mathcal{K},
k𝒦𝒗nk22Pn𝗆𝖺𝗑,n𝒩,\displaystyle\sum_{k\in\mathcal{K}}\|\bm{v}_{nk}\|_{2}^{2}\leq P_{n}^{\sf{max}},~{}\forall\,n\in\mathcal{N},

where Pn𝗆𝖺𝗑>0P_{n}^{\sf{max}}>0 denotes the maximum transmit power of edge BS nn.

Unfortunately, problem (7) turns out to be a mixed combinatorial optimization problem due to the presence of combinatorial variable 𝒜\mathcal{A}, which makes it computationally intractable [45]. On the other hand, the nonconvex SINR constraints also pose troublesome challenges for solving (7). To address these issues, we recast problem (7) into a tractable formulation by inducing the group sparsity of the beamforming vector in the following section.

II-C A Group Sparse Beamforming Representation Framework

One naive approach to cope with the combinatorial variable 𝒜\mathcal{A} is the exhaustive search. However, it is often computationally prohibitive owing to the exponential complexity. As a practical alternative, there is a critical observation that such a combinatorial variable 𝒜\mathcal{A} can be eliminated by exploiting the inherent connection between task selection and the group sparsity structure of beamforming vectors. Specifically, if edge BS nn does not perform the inference tasks from MU kk (i.e., k𝒜nk\notin\mathcal{A}_{n}), then it will not deliver the inference result ϕk(𝒅k)\phi_{k}(\bm{d}_{k}) in the downlink transmission (i.e., 𝒗nk=𝟎\bm{v}_{nk}=\bm{0}). In other words, if k𝒜nk\notin\mathcal{A}_{n}, all coefficients in the beamforming vector 𝒗nk\bm{v}_{nk} are zero simultaneously. Mathematically, we have 𝒜n={k𝒗nk𝟎,k𝒦}\mathcal{A}_{n}=\{k\mid\bm{v}_{nk}\neq\bm{0},~{}k\in\mathcal{K}\}, for all n𝒩n\in\mathcal{N}, meaning the task selection strategy 𝒜\mathcal{A} can be uniquely determined by the group sparsity structure of 𝒗nk\bm{v}_{nk}. In this respect, the overall network power consumption problem (7) can rewritten as

Psparse({𝒗nk})=n=1Nk=1K1ηn𝒗nk22+n=1Nk=1K𝕀(𝒗nk𝟎)Pnk𝖼.P_{\textrm{sparse}}(\{\bm{v}_{nk}\})\!=\!\sum_{n=1}^{N}\!\sum_{k=1}^{K}\!\frac{1}{\eta_{n}}\|\bm{v}_{nk}\|_{2}^{2}\!+\!\sum_{n=1}^{N}\!\sum_{k=1}^{K}\mathbb{I}(\bm{v}_{nk}\neq\bm{0})P_{nk}^{\sf{c}}. (8)

By considering the sparsity structure in the beamforming vectors, the SINR expression (3) is transformed into

SINRk\displaystyle\text{SINR}_{k} =|n𝒩𝒉nk𝖧𝒗nk|2lk|n𝒩𝒉nk𝖧𝒗nl|2+σk2\displaystyle=\frac{|\sum_{n\in\mathcal{N}}\bm{h}_{nk}^{\sf{H}}\bm{v}_{nk}|^{2}}{\sum_{l\neq k}|\sum_{n\in\mathcal{N}}\bm{h}_{nk}^{\sf{H}}\bm{v}_{nl}|^{2}+\sigma^{2}_{k}} (9)
=|𝒉k𝖧𝒗k|2lk|𝒉k𝖧𝒗l|2+σk2,k𝒦,\displaystyle=\frac{|\bm{h}_{k}^{\sf{H}}\bm{v}_{k}|^{2}}{\sum_{l\neq k}|\bm{h}_{k}^{\sf{H}}\bm{v}_{l}|^{2}+\sigma^{2}_{k}},~{}\,\forall\,k\in\mathcal{K},

where 𝒉k=[𝒉1k𝖧,,𝒉Nk𝖧]𝖧\bm{h}_{k}=\left[\bm{h}_{1k}^{\sf{H}},\dots,\bm{h}_{Nk}^{\sf{H}}\right]^{\sf{H}} and 𝒗k=[𝒗1k𝖳,,𝒗Nk𝖳]𝖳\bm{v}_{k}\!=\!\left[\bm{v}_{1k}^{\sf{T}},\dots,\bm{v}_{Nk}^{\sf{T}}\right]^{\sf{T}} are the aggregated channel vector and downlink transmit beamforming vector for MU kk, respectively.

On the other hand, since an arbitrary phase rotation of the transmit beamforming vectors {𝒗nk}\{\bm{v}_{nk}\} does not affect the downlink SINR constraints and the objective function value, we can always find proper phases to equivalently transform the SINR constraints in (7) into convex second-order cone constraints [46]. We thus have the following convex-constrained sparse optimization framework for network power minimization

min{𝒗nk}\displaystyle\min_{\{\bm{v}_{nk}\}}{} Psparse({𝒗nk})\displaystyle P_{\textrm{sparse}}(\{\bm{v}_{nk}\}) (10)
s.t. k𝒦𝒗nk22Pn𝗆𝖺𝗑,n𝒩,\displaystyle\sum_{k\in\mathcal{K}}\|\bm{v}_{nk}\|_{2}^{2}\leq P_{n}^{\sf{max}},~{}\forall n\in\mathcal{N},
lk|𝒉k𝖧𝒗l|2+σk21γk(𝒉k𝖧𝒗k),k𝒦.\displaystyle\sqrt{\sum_{l\neq k}\!|\bm{h}_{k}^{\sf{H}}\bm{v}_{l}|^{2}+\!\sigma^{2}_{k}}\!\leq\frac{1}{\sqrt{\gamma_{k}}}\mathfrak{R}(\bm{h}_{k}^{\sf{H}}\bm{v}_{k}),~{}\forall k\in\mathcal{K}.

However, problem (10) is still nonconvex due to the indicator function in the objective function. As presented in [29, Proposition 1], a weighted mixed 1,2\ell_{1,2}-norm can be served as the tightest convex surrogate of the objective in (10), i.e.,

P({𝒗nk})=2n=1Nk=1KPnk𝖼/ηn𝒗nk2.P(\{\bm{v}_{nk}\})=2\sum_{n=1}^{N}\sum_{k=1}^{K}\sqrt{P_{nk}^{\sf{c}}/\eta_{n}}\|\bm{v}_{nk}\|_{2}. (11)

In this paper, we instead propose to adopt a new group sparsity inducing function for inference tasks selection via enhancing sparsity, thereby further reducing the network power consumption.

III A Los-sum Function Based Three-stage Group Sparse Beamforming Framework

In this section, we shall propose to adopt the log-sum function to enhance the group sparsity of the beamforming vector, followed by describing the log-sum function based three-stage GSBF approach. In particular, we propose a proximal iteratively reweighted algorithm to address the log-sum minimization problem in the first stage.

III-A Log-sum Function for Enhancing Group Sparsity

Let 𝒗=[𝒗11𝖳,,𝒗1K𝖳,,𝒗N1𝖳,,𝒗NK𝖳]𝖳LNK\bm{v}=\!\left[\bm{v}_{11}^{\sf{T}},\cdots,\bm{v}_{1K}^{\sf{T}},\cdots,\bm{v}_{N1}^{\sf{\sf{T}}},\cdots,\bm{v}_{NK}^{\sf{T}}\right]^{\sf{T}}\!\in\!\mathbb{C}^{LNK} denote the aggregated beamforming vector 𝒗\bm{v}. To promote the group sparsity for the beamforming vector 𝒗nk\bm{v}_{nk}, in this paper, we propose to use the following weighted nonconvex log-sum function as an approximation for the objective Psparse(𝒗)P_{\textrm{sparse}}(\bm{v})

Ω(𝒗):=n=1Nk=1Kρnklog(1+p𝒗nk2),\Omega(\bm{v}):=\sum_{n=1}^{N}\sum_{k=1}^{K}\rho_{nk}\log(1+p\|\bm{v}_{nk}\|_{2}), (12)

where ρnk=Pnk𝖼/ηn>0\rho_{nk}=\sqrt{P^{\sf{c}}_{nk}/\eta_{n}}>0 is a weight coefficient and p>0p>0 is a tunable parameter. The main motivation for adopting such a log-sum penalty among various types of sparsity-inducing functions [27] is based on the following considerations:

  • The mixed 1,2\ell_{1,2}-norm is similar as an 1\ell_{1}-norm of vector 𝒗\bm{v} and thereof offers the tightest convex relaxation to the 0\ell_{0}-norm. In contrast to the mixed 1,2\ell_{1,2}-norm, it has been reported that the log-sum function can significantly enhance the sparsity of the solution than the conventional 1\ell_{1}-norm [27, 34].

  • From the perspective of performance and theoretical analysis of the designed algorithm, a log-sum function brings more practicability due to its coercivity and boundedness of its first derivative.

III-B A Log-sum Function Based Three-stage Group Sparse Beamforming Approach

We present the proposed log-sum based three-stage GSBF framework. Specifically, the first stage is to solve the log-sum convex-constrained problem via the proposed proximal iteratively reweighted algorithm to obtain a solution 𝒗sparse\bm{v}_{\textrm{sparse}}; the second stage prioritizes the tasks in progress based on the obtained solution 𝒗sparse\bm{v}_{\textrm{sparse}} and system parameters, followed by obtaining the optimal task selection strategy 𝒜\mathcal{A}^{*}; with fixed 𝒜\mathcal{A}^{*}, we refine the 𝒗\bm{v} in the third stage. Details are depicted as follows.

Stage 1: Log-sum Function Minimization. In this first stage, we obtain the group sparsity structure of beamformer 𝒗\bm{v} by solving the following nonconvex program

min𝒗Ω(𝒗)s.t.𝒗𝒞,\displaystyle\begin{aligned} \min_{\bm{v}}~{}\Omega(\bm{v})\quad\textrm{s.t.}\quad\bm{v}\in\mathcal{C},\end{aligned} (13)

where 𝒞\mathcal{C} denotes the convex-constraints in (10).

However, the nonconvex and nonsmooth objective in (13) and the presence of the convex constraints usually pose challenges in computation and analysis. Inspired by the work of [34], we can iteratively minimize the objective by solving a sequence of tractable convex subproblems. The main idea of our presented algorithm is to solve a well-constructed convex surrogate subproblem instead of directly solving the original nonconvex problem.

Let fp(𝒗nk):=log(1+p𝒗nk2)f_{p}(\bm{v}_{nk}):=\log(1+p\|\bm{v}_{nk}\|_{2}). First observe that fp(𝒗nk)f_{p}(\bm{v}_{nk}) is a composite function with 𝒛(𝒗nk)=𝒗nk2\bm{z}(\bm{v}_{nk})=\|{\bm{v}}_{nk}\|_{2} convex and fp(𝒛)=log(1+pznk)f_{p}(\bm{z})=\log(1+pz_{nk}) nonconvex. At the iith iterate 𝒗nk[i]\bm{v}_{nk}^{[i]}, for any feasible 𝒗~nk\tilde{\bm{v}}_{nk}, we have

fp(𝒛(𝒗~nk))\displaystyle f_{p}(\bm{z}(\tilde{\bm{v}}_{nk})) fp(𝒛(𝒗nk[i]))+𝒘[i],𝒛(𝒗~nk)𝒛(𝒗nk[i])\displaystyle\leq f_{p}(\bm{z}(\bm{v}_{nk}^{[i]}))+\langle\bm{w}^{[i]},\bm{z}(\tilde{\bm{v}}_{nk})-\bm{z}(\bm{v}_{nk}^{[i]})\rangle (14)
fp(𝒛(𝒗nk[i]))+𝒘[i],𝒛(𝒗~nk)𝒛(𝒗nk[i])\displaystyle\leq f_{p}(\bm{z}(\bm{v}_{nk}^{[i]}))+\langle\bm{w}^{[i]},\bm{z}(\tilde{\bm{v}}_{nk})-\bm{z}(\bm{v}_{nk}^{[i]})\rangle
+β2𝒗~nk𝒗nk[i]22,\displaystyle\quad+\frac{\beta}{2}\|\tilde{\bm{v}}_{nk}-\bm{v}_{nk}^{[i]}\|_{2}^{2},

where 𝒘[i](fp(𝒛(𝒗nk[i])))\bm{w}^{[i]}\in\partial(f_{p}(\bm{z}(\bm{v}_{nk}^{[i]}))) is the subgradient of fp(𝒛)f_{p}(\bm{z}) at 𝒛=𝒛(𝒗nk[i])\bm{z}=\bm{z}(\bm{v}_{nk}^{[i]}) and β>0\beta>0 is the prescribed proximity parameter, and the first inequality holds by the definition of the subgradient of the convex function. Hence, a convex subproblem is derived as an approximation of fp(𝒗nk)f_{p}(\bm{v}_{nk}) at current iterate 𝒗nk[i]\bm{v}_{nk}^{[i]}, which reads

min{𝒗nk}\displaystyle\min_{\{\bm{v}_{nk}\}} n=1Nk=1Kwnk[i]𝒗nk2+β2n=1Nk=1K𝒗nk𝒗nk[i]22\displaystyle\sum_{n=1}^{N}\sum_{k=1}^{K}w_{nk}^{[i]}\|\bm{v}_{nk}\|_{2}+\frac{\beta}{2}\sum_{n=1}^{N}\sum_{k=1}^{K}\|\bm{v}_{nk}-\bm{v}_{nk}^{[i]}\|_{2}^{2} (15)
s.t. 𝒗𝒞\displaystyle\bm{v}\in\mathcal{C}

with weights

wnk[i]=ρnk(fp(𝒛(𝒗nk[i])))=pρnkp𝒗nk[i]2+1.w_{nk}^{[i]}=\rho_{nk}\cdot\partial(f_{p}(\bm{z}(\bm{v}_{nk}^{[i]})))=\frac{p\rho_{nk}}{{p\|\bm{v}_{nk}^{[i]}\|_{2}}+1}. (16)

As presented in [34], a smaller 𝒗nk[i]2\|\bm{v}_{nk}^{[i]}\|_{2} causes larger wnk[i]w_{nk}^{[i]}, then drive the nonzero components of 𝒗nk\bm{v}_{nk} towards zero aggressively. Overall, to enhance the group sparsity structure of the beamforming vector, the proposed proximal iteratively reweighted algorithm is illustrated in Algorithm 1.

1:  Input: p>0p>0, β>0\beta>0, IterMax and 𝒗[0]\bm{v}^{[0]}.
2:  Initialization: 𝒘[0]=𝟏\bm{w}^{[0]}=\bm{1} and set i=0i=0.
3:  while not converge or not attain IterMax do
4:     (Solve the reweighted subproblem for 𝒗[i+1]\bm{v}^{[i+1]}) Calculate 𝒗[i+1]\bm{v}^{[i+1]} according to (15).
5:     (Reweighting) Update weight wnk[i]w_{nk}^{[i]} according to (16).
6:     Set ii+1i\leftarrow i+1.
7:  end while
8:  Output: 𝒗sparse=𝒗[i]\bm{v}_{\textrm{sparse}}=\bm{v}^{[i]}.
Algorithm 1 A Proximal Iteratively Reweighted Algorithm for Solving (13)

Stage 2: Tasks Selection. In this second stage, an ordering guideline is applied to determine the priority of inference tasks, which is guided by the solution obtained in Stage 1. For ease of notation, let 𝒮={(n,k)n𝒩,k𝒦}\mathcal{S}=\{(n,k)\mid n\in\mathcal{N},k\in\mathcal{K}\} denote the set of all tasks. By considering the key system parameters (e.g., 𝒉nk\bm{h}_{nk}, Pnk𝖼P_{nk}^{\sf{c}} and ηn\eta_{n}), the priority of task (n,k)(n,k) is heuristically given as

θnk=𝒉nk22ηnPnk𝖼𝒗nk2.\theta_{nk}=\sqrt{\frac{\|\bm{h}_{nk}\|_{2}^{2}\eta_{n}}{{P}_{nk}^{\sf{c}}}}\|\bm{v}_{nk}\|_{2}. (17)

Intuitively, if edge BS nn is with a lower aggregative beamformer gain, lower power amplifier efficiency, lower channel power gain, but a higher computation power consumption for MU kk, task (n,k)(n,k) has a lower priority. A lower θnk\theta_{nk} indicates that the tasks from MU kk have lower priority and may not be performed by BS nn. Thus, tasks are arranged in light of the rule (17) with descending order. That is, the task’s priority is θπ1θπ2θπNK\theta_{\pi_{1}}\geq\theta_{\pi_{2}}\dots\geq\theta_{\pi_{NK}}, where π\pi denotes the permutation of task indexes.

We then solve a sequence of convex feasibility detection problems to obtain task selection strategy 𝒜\mathcal{A},

find𝒗s.t.𝒗π(t)=𝟎,𝒗𝒞,\displaystyle\mathop{\textrm{find}}~{}\bm{v}\quad\textrm{s.t.}\quad\bm{v}_{\pi^{(t)}}=\bm{0},~{}\bm{v}\in\mathcal{C}, (18)

where π(t)={πt+1,,πNK}\pi^{(t)}=\{\pi_{t+1},\cdots,\pi_{NK}\} and tt increases from KK to NKNK until (18) is feasible. Here 𝒗π(t)=𝟎\bm{v}_{\pi^{(t)}}=\bm{0} are convex constraints, meaning that all 𝒗nk\bm{v}_{nk}’s coefficients are zeros for task (n,k)π(t)(n,k)\in\pi^{(t)}. The support set of beamformer 𝒗\bm{v} is defined as 𝒯(𝒗)={(n,k)𝒗nk20,n𝒩,k𝒦}\mathcal{T}(\bm{v})=\{(n,k)\mid\|\bm{v}_{nk}\|_{2}\neq 0,n\in\mathcal{N},k\in\mathcal{K}\}, then the optimal task selection strategy 𝒜\mathcal{A}^{*} can be derived from 𝒯(𝒗)={π1,,πt}\mathcal{T}(\bm{v})=\{\pi_{1},\dots,\pi_{t}\}.

Stage 3: Solution Refinement. At this point, we have determined tasks selection for each BS. Then, fix the obtained task index set, we solve the following convex program to refine the beamforming vectors

min𝒗\displaystyle\min_{\bm{v}} n=1Nk=1K1ηn𝒗nk22+n=1Nk𝒜Pnk𝖼\displaystyle\sum_{n=1}^{N}\sum_{k=1}^{K}\frac{1}{\eta_{n}}\|\bm{v}_{nk}\|_{2}^{2}+\sum_{n=1}^{N}\sum_{k\in\mathcal{A}^{*}}P_{nk}^{\sf{c}} (19)
s.t. 𝒗π(t)=𝟎,\displaystyle\bm{v}_{\pi^{(t)}}=\bm{0},
𝒗𝒞.\displaystyle\bm{v}\in\mathcal{C}.

Overall, our proposed log-sum function based three-stage GSBF framework for solving (7) can be presented in Algorithm 2.

1:  Stage 1: Log-sum function minimization. Call Algorithm 1 to obtain 𝒗sparse\bm{v}_{\textrm{sparse}}.
2:  Stage 2: Tasks selection.
3:  Sort θnk\theta_{nk}’s in descending order by (17).
4:  Set t=Kt=K.
5:  while not feasible do
6:     (Determine the optimal task selection) Calculate (18) with π(t)={πt+1,,πNK}\pi^{(t)}=\{\pi_{t+1},\dots,\pi_{NK}\}.
7:     Set tt+1t\leftarrow t+1.
8:  end while
9:  Stage 3: Solution refinement.
10:  Refine {𝒗nk}\{\bm{v}_{nk}\} by solving (19).
11:  Obtain {𝒗nk}\{\bm{v}_{nk}\} and 𝒜\mathcal{A}^{*}.
Algorithm 2 A Log-sum Function Based Three-stage GSBF Framework for Solving (7)

IV Global Convergence Analysis

In this section, we provide the global convergence for Algorithm 1. Specifically, we derive the first-order necessary optimality condition to characterize the optimal solutions. We then establish convergence results for a subsequence of the sequence generated by Algorithm 1. Furthermore, we show that for any initial feasible point, the entire sequence must have cluster points, and any cluster point satisfies the established first-order optimality condition. Finally, the ergodic worst-case convergence rate of the optimality residual is derived.

IV-A First-order Necessary Optimality Condition

In this subsection, we derive the first-order necessary conditions to characterize the optimal solution of (13). Problem (13) is equivalently rewritten as

min𝒗J(𝒗):=Ω(𝒗)+δ𝒞(𝒗).\min_{\bm{v}}~{}J(\bm{v}):=\Omega(\bm{v})+\delta_{\mathcal{C}}(\bm{v}). (20)

Similarly, for the derived subproblem (15), we have

min𝒗G(𝒗;𝒗[i]):=n=1Nk=1Kwnk[i]𝒗nk2+β2𝒗𝒗[i]22+δ𝒞(𝒗).\displaystyle\min_{\bm{v}}~{}\!G(\bm{v};\bm{v}^{[i]}):=\!\sum_{n=1}^{N}\!\sum_{k=1}^{K}\!w_{nk}^{[i]}\|\bm{v}_{nk}\|_{2}+\frac{\beta}{2}\|\bm{v}-\bm{v}^{[i]}\|_{2}^{2}\!+\!\delta_{\mathcal{C}}(\bm{v}). (21)

Due to the nonconvex and nonsmooth nature of the log-sum function, we make use of the Fréchet subdifferential as the major tool in our analysis. Its definition is introduced as follows.

Definition 1 (Fréchet subdifferential [26])

Let 𝒳\mathcal{X} be a real Banach space and 𝒳\mathcal{X}^{*} denotes the corresponding topological dual and ff be a function from 𝒳\mathcal{X} into an extended real line ¯={+}\bar{\mathbb{R}}=\mathbb{R}\cup\{+\infty\}, finite at 𝐫\bm{r}. A set

Ff(𝒓)={𝒓𝒳|lim inf𝒖𝒓f(𝒖)f(𝒓)𝒓,𝒖𝒓𝒖𝒓20}\partial_{F}f(\bm{r})\!=\!\left\{\bm{r}^{*}\!\in\!\mathcal{X}^{*}|\liminf_{\bm{u}\to\bm{r}}\frac{f(\bm{u})-f(\bm{r})\!-\!\langle\bm{r}^{*},\bm{u}-\bm{r}\rangle}{\|\bm{u}\!-\!\bm{r}\|_{2}}\geq 0\right\}

is called a Fréchet subdifferential of ff at 𝐫\bm{r}. Its elements are referred to as Fréchet subgradients.

Several important properties of the Fréchet subdifferential [26] are listed below, which are used to characterize the optimal solution of (13).

Proposition 1

Let XX be a closed and convex set. Then the following properties on Fréchet subdifferentials holds true.

  • (i)(i)

    If ff is Fréchet subdifferentiable at 𝒙\bm{x} and attains local minimum at 𝒙\bm{x}, then 𝟎Ff(𝒙)\bm{0}\in\partial_{F}f(\bm{x}).

  • (ii)(ii)

    Let h()h(\cdot) be Fréchet subdifferentiable at z¯=z(𝒙¯)\bar{z}=z(\bar{\bm{x}}) with z(𝒙)z(\bm{x}) being convex, then hz(𝒙)h\circ z(\bm{x}) is Fréchet subdifferentiable at 𝒙¯\bar{\bm{x}} such that

    y¯z(𝒙¯)Fhz(𝒙¯)\bar{y}\partial z(\bar{\bm{x}})\subset\partial_{F}h\circ z(\bar{\bm{x}})

    for any y¯Fh(z¯)\bar{y}\in\partial_{F}h(\bar{z}).

  • (iii)(iii)

    𝒩X(𝒙)=FδX(𝒙)\mathcal{N}_{X}(\bm{x})=\partial_{F}\delta_{X}(\bm{x}) with closed and convex sets XX.

The following Fermat’s rule [47] describes the necessary optimality condition of problem (13).

Theorem 1 (Fermat’s rule)

If (20) attains a local minimum at 𝐯\bm{v}, then it holds true that

𝟎FJ(𝒗):=FΩ(𝒗)+𝒩𝒞(𝒗).\bm{0}\in\partial_{F}J(\bm{v}):=\partial_{F}\Omega(\bm{v})+\mathcal{N}_{\mathcal{C}}(\bm{v}). (22)

We next investigate the properties of FΩ(𝒗)\partial_{F}\Omega(\bm{v}) in the following Proposition 2, indicating that the Fréchet subdifferentials of fp()f_{p}(\cdot) at 𝟎\bm{0} is bounded.

Proposition 2

If 𝐯nk𝟎\bm{v}_{nk}\neq\bm{0}, then γ𝐳(𝐯nk)Ffp𝐳(𝐯nk)\gamma\partial\bm{z}(\bm{v}_{nk})\subset\partial_{F}f_{p}\circ\bm{z}(\bm{v}_{nk}) for any γFfp(𝐳)\gamma\in\partial_{F}f_{p}(\bm{z}). In particular, Ffp𝐳(𝟎)\partial_{F}f_{p}\circ\bm{z}(\bm{0}) is any element of {𝐲L𝐲2p}\{\bm{y}\in\mathbb{C}^{L}\mid\|\bm{y}\|_{2}\leq p\}.

To explore the behavior of the proposed proximal iteratively reweighted algorithm, based on Theorem 1 and Proposition 2, we define the optimality residual associated with (20) at a point 𝒗[i]\bm{v}^{[i]} as

𝒓[i]:=𝒘[i]𝒙[i]+𝒖[i],\bm{r}^{[i]}:=\bm{w}^{[i]}\odot\bm{x}^{[i]}+\bm{u}^{[i]}, (23)

where 𝒙[i]𝒗[i]𝒢2=[(𝒗11[i]2)𝖳,,(𝒗NK[i]2)𝖳]𝖳\bm{x}^{[i]}\in\partial\|\bm{v}^{[i]}\|_{\mathcal{G}_{2}}=\left[(\partial\|\bm{v}_{11}^{[i]}\|_{2})^{\sf{T}},\cdots,(\partial\|\bm{v}_{NK}^{[i]}\|_{2})^{\sf{T}}\right]^{\sf{T}} and 𝒖[i]𝒩𝒞(𝒗[i])\bm{u}^{[i]}\in\mathcal{N}_{\mathcal{C}}(\bm{v}^{[i]}). Since 𝒓[i]FJ(𝒗[i])\bm{r}^{[i]}\in\partial_{F}J(\bm{v}^{[i]}), it implies that if 𝒓[i]=𝟎\bm{r}^{[i]}=\bm{0} then 𝒗[i]\bm{v}^{[i]} satisfies the first-order necessary optimality condition (22). We adopt 𝒓[i]\bm{r}^{[i]} to measure the convergence rate of our algorithm.

Moreover, we provide the first-order optimality condition of the subproblem (21) as follows

𝟎=G(𝒗;𝒗[i])=β(𝒗[i+1]𝒗[i])+𝒘[i]𝒙[i+1]+𝒖[i+1],\displaystyle\bm{0}=\partial G(\bm{v};\bm{v}^{[i]})=\beta(\bm{v}^{[i+1]}-\bm{v}^{[i]})+\bm{w}^{[i]}\odot\bm{x}^{[i+1]}+\bm{u}^{[i+1]}, (24)

where 𝒗[i+1]=argmin𝒗G(𝒗;𝒗[i])\bm{v}^{[i+1]}=\operatornamewithlimits{arg\,min}_{\bm{v}}G(\bm{v};\bm{v}^{[i]}), 𝒙[i+1]𝒗[i+1]𝒢2\bm{x}^{[i+1]}\in\partial\|\bm{v}^{[i+1]}\|_{\mathcal{G}_{2}} and 𝒖[i+1]𝒩𝒞(𝒗[i+1])\bm{u}^{[i+1]}\in\mathcal{N}_{\mathcal{C}}(\bm{v}^{[i+1]}). Note that the existence of optimal solution to (21) simply follows from the convexity and the coercivity of the objective G(;𝒗[i])G(\cdot;\bm{v}^{[i]}).

Now we show that an optimal solution of (21) also satisfies the first-order necessary optimality condition of (20) in the following lemma.

Lemma 1

𝒗[i]\bm{v}^{[i]} satisfies the first-order necessary optimality condition of (20) if and only if

𝒗[i]=argmin𝒗G(𝒗;𝒗[i]).\bm{v}^{[i]}=\operatornamewithlimits{arg\,min}_{\bm{v}}G(\bm{v};\bm{v}^{[i]}).
Proof:

Please refer to Appendix A for details. ∎

Define the model reduction caused by 𝒗[i+1]\bm{v}^{[i+1]} at a point 𝒗[i]\bm{v}^{[i]} as

ΔG(𝒗[i+1];𝒗[i]):=G(𝒗[i];𝒗[i])G(𝒗[i+1];𝒗[i]).\Delta G(\bm{v}^{[i+1]};\bm{v}^{[i]}):=G(\bm{v}^{[i]};\bm{v}^{[i]})-G(\bm{v}^{[i+1]};\bm{v}^{[i]}). (25)

The new iterate 𝒗[i+1]\bm{v}^{[i+1]} causes a decrease in the objective J(𝒗)J(\bm{v}), and this model reduction (25) converges to zero in the limit, both results are revealed in the following Lemma 2.

Lemma 2

Suppose {𝐯[i]}i=0\{\bm{v}^{[i]}\}_{i=0}^{\infty} is generated by of Algorithm 1 with 𝐯[0]𝒞\bm{v}^{[0]}\in\mathcal{C}. The following statements hold true

  • (i)(i)

    J(𝒗[i+1])J(𝒗[i])G(𝒗[i+1];𝒗[i])G(𝒗[i];𝒗[i])0J(\bm{v}^{[i+1]})-J(\bm{v}^{[i]})\leq G(\bm{v}^{[i+1]};\bm{v}^{[i]})-G(\bm{v}^{[i]};\bm{v}^{[i]})\leq 0.

  • (ii)(ii)

    limiG(𝒗[i+1];𝒗[i])=0\lim\limits_{i\to\infty}G(\bm{v}^{[i+1]};\bm{v}^{[i]})=0.

  • (iii)(iii)

    G(𝒗;𝒗[i])G(\bm{v};\bm{v}^{[i]}) is monotonically decreasing. Indeed,

    ΔG(𝒗[i+1];𝒗[i])β2𝒗[i]𝒗[i+1]22.\Delta G(\bm{v}^{[i+1]};\bm{v}^{[i]})\geq\frac{\beta}{2}\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\|_{2}^{2}.
Proof:

Please refer to Appendix B for details. ∎

We now provide our main result in the following Theorem 2.

Theorem 2

Suppose {𝐯[i]}i=0\{\bm{v}^{[i]}\}_{i=0}^{\infty} is generated by Algorithm 1 with 𝐯[0]𝒞\bm{v}^{[0]}\in\mathcal{C}. It holds true that {𝐯[i]}\{\bm{v}^{[i]}\} must be bounded and any cluster point of {𝐯[i]}\{\bm{v}^{[i]}\} satisfies the first-order necessary optimality condition of (20).

Proof:

Please refer to Appendix C for details. ∎

IV-B Ergodic Worst-case Convergence Rate

In this subsection, we show that the presented proximal iteratively reweighted algorithm has O(1/t)O(1/t) ergodic worst-case convergence rate in terms of the optimality residual. In the following Lemma, it states that the optimality residual has an upper bound with the displacement of the iterates.

Lemma 3

The optimality residual associated with problem (20) satisfies

𝒓[i+1]22(β2+2βκp2+κ2p4)𝒗[i]𝒗[i+1]22\|\bm{r}^{[i+1]}\|_{2}^{2}\leq(\beta^{2}+2\beta\kappa p^{2}+\kappa^{2}p^{4})\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\|_{2}^{2}

with κ=ρmax\kappa=\rho^{\max}111ρmax\rho^{\max} denotes the maximum elements among [ρnk]\left[\rho_{nk}\right] for all n𝒩n\in\mathcal{N}, k𝒦k\in\mathcal{K}..

Proof:

Please refer to Appendix D for details. ∎

The subproblem (21) is referred to as the primal problem, and by exploiting the conjugate function [47], the associated Fenchel-Rockafellar dual is constructed as

max𝝀,𝝁\displaystyle\max_{\bm{\lambda},\bm{\mu}} Q(𝝀,𝝁;𝒗[i])\displaystyle Q(\bm{\lambda},\bm{\mu};\bm{v}^{[i]}) (26)
s.t. 𝝀nk21,n𝒩,k𝒦,\displaystyle\|\bm{\lambda}_{nk}\|_{2}\leq 1,\quad\forall n\in\mathcal{N},k\in\mathcal{K},

where the dual objective is given as Q(𝝀,𝝁;𝒗[i])=12β𝝀𝒘[i]+𝝁β𝒗[i]22+β2𝒗[i]22δ𝒞(𝝁)Q(\bm{\lambda},\bm{\mu};\bm{v}^{[i]})=-\frac{1}{2\beta}\|\bm{\lambda}\odot\bm{w}^{[i]}+\bm{\mu}-\beta\bm{v}^{[i]}\|_{2}^{2}+\frac{\beta}{2}\|\bm{v}^{[i]}\|_{2}^{2}-\delta_{\mathcal{C}}^{*}(\bm{\mu}), and the technical details to construct (26) is provided in Appendix E.

The Fenchel-Rockafellar duality theorem [47] states that the solution to (26) provides a lower bound on the minimum value to the solution of (21). Moreover, the gap between the primal objective function value of (21) and the corresponding dual objective function value of (26) at the iith iterate is defined as

g(𝒗,𝝀,𝝁;𝒗[i]):=G(𝒗;𝒗[i])Q(𝝀,𝝁;𝒗[i]).g(\bm{v},\bm{\lambda},\bm{\mu};\bm{v}^{[i]}):=G(\bm{v};\bm{v}^{[i]})-Q(\bm{\lambda},\bm{\mu};\bm{v}^{[i]}). (27)

If this gap g(𝒗,𝝀,𝝁;𝒗[i])g(\bm{v},\bm{\lambda},\bm{\mu};\bm{v}^{[i]}) is zero, then the strong duality holds. That is, at the optimal solution (𝒗[i+1];𝝀[i+1],𝝁[i+1])(\bm{v}^{[i+1]};\bm{\lambda}^{[i+1]},\bm{\mu}^{[i+1]}), we have

ΔG(𝒗[i+1];𝒗[i])=g(𝒗[i+1],𝝀[i+1],𝝁[i+1];𝒗[i]).\Delta G(\bm{v}^{[i+1]};\bm{v}^{[i]})=g(\bm{v}^{[i+1]},\bm{\lambda}^{[i+1]},\bm{\mu}^{[i+1]};\bm{v}^{[i]}). (28)

We now show that the duality gap vanish asymptotically in the following Theorem.

Theorem 3

Let {𝐯[i]}i=0\{\bm{v}^{[i]}\}_{i=0}^{\infty} be the sequence generated by Algorithm 1 with 𝐯[0]𝒞\bm{v}^{[0]}\in\mathcal{C}. Then r[i+1]22\|r^{[i+1]}\|_{2}^{2} has O(1t)O(\frac{1}{t}) ergodic worst-case convergence rate.

Proof:

Please refer to Appendix F for details. ∎

V Numerical Experiments

In this section, we use numerical experiments to validate the effectiveness of our proposed algorithms and illustrate the presented theoretical results. We compare the log-sum function based three-stage GSBF approach with the coordinated beamforming approach (CB) [48] and mixed 1,2\ell_{1,2} GSBF [29] beamforming approach (Mixed 1,2\ell_{1,2} GSBF). These two approaches are listed below:

  • CB considers minimizing the total transmit power consumption. In other words, all BSs are required to perform the inference tasks from all MUs.

  • Mixed 𝟏,𝟐\bm{\ell_{1,2}} GSBF considers adopting the mixed 1,2\ell_{1,2}-norm (i.e., the objective function in (13) is replaced with n=1Nk=1Kρnk𝒗nk2\sum_{n=1}^{N}\sum_{k=1}^{K}\rho_{nk}\|\bm{v}_{nk}\|_{2}) to induce group sparsity of the beamforming vector in Stage 1 of Algorithm 2.

On the experimental set-up, we consider the edge AI inference system with 88 22-antennas, and 1515 single-antenna MUs that all are uniformly and independently distributed in a [0.5,0.5]\left[-0.5,0.5\right]km ×\times [0.5,0.5]\left[-0.5,0.5\right]km square region. The channel between BS nn and MU kk is set as 𝒉nk=10L(dnk)/20𝝃nk\bm{h}_{nk}=10^{-L(d_{nk})/20}\bm{\xi}_{nk}, where the path-loss model is given by L(dnk)=128.1+37.6log10dnkL(d_{nk})=128.1+37.6\log_{10}d_{nk} and dnkd_{nk} is the Euclidean distance between BS nn and MU kk, 𝝃nk\bm{\xi}_{nk} is the small-scale fading coefficient, i.e., 𝝃nk𝒞𝒩(𝟎,𝑰)\bm{\xi}_{nk}\sim\mathcal{CN}(\bm{0},\bm{I}). We set Pnk𝖼=0.45P^{\sf{c}}_{nk}=0.45W and specify Pnmax=1P_{n}^{\max}=1W, ηn=25%\eta_{n}=25\% and σk2=1\sigma_{k}^{2}=1. Furthermore, for the proposed log-sum function based three-stage GSBF approach, we set p=100p=100, β=0.1\beta=0.1 and initialize 𝒘[0]=𝟏\bm{w}^{[0]}=\bm{1}. In particular, we terminate the proximal iteratively reweighted algorithm either it hits the predefined maximum iterations IterMax=25\textrm{IterMax}=25 or satisfies

𝒘[i+1]𝒘[i]1ϵ,\|\bm{w}^{[i+1]}-\bm{w}^{[i]}\|_{1}\leq\epsilon, (29)

where ϵ=105\epsilon=10^{-5} is a predescribed tolerance.

V-A Convergence of the Proximal Iteratively Reweighted Algorithm

The goal in this subsection is to illustrate the convergence behavior of the proposed proximal iteratively reweighted algorithm. The presented result is obtained in a typical channel realization. Fig. 3 illustrates the convergence of the proximal iteratively reweighted algorithm. We can see that Ω(𝒗)\Omega(\bm{v}) steadily decreases along with the iterations, which is consistent with our analysis in Lemma 2. Interestingly, we observe that the objective value of Ω(𝒗)\Omega(\bm{v}) drops quickly in the first few iterations (less 55 iterations), which indicates that the proposed proximal iteratively reweighted algorithm converges very fast. In view of this, we may suggest early terminating the Algorithm 1 in practice to obtain an approximate solution to speed up the entire algorithm while guaranteeing the overall performance.

Refer to caption
Figure 3: Convergence of the proximal iteratively reweighted algorithm for log-sum minimization problem.

V-B Effectiveness of the Proposed Approach

We evaluate the performance of the three algorithms in terms of the overall network power consumption, the transmit power consumption and the number of computation tasks. The presented results are averaged over 100100 randomly and independently generated channel realizations.

Fig. 4 depicts the overall network power consumption of three approaches with different target SINRs. First, we observe that all three approaches have higher total power consumption as the required SINR becomes more stringent. This is because more edge BSs are required to transmit the inference results for higher QoS. In addition, we can see that CB approach has the highest power consumption among three approaches and the relative power difference between CB and the other two approaches can achieve approximately 67%67\% when SINR is 22dB and approximately 18%18\% when SINR is 88dB, indicating the effectiveness of joint task selection strategy and group sparse beamforming approach to minimize the overall network power consumption. On the other hand, we can see that the proposed log-sum function based three-stage GSBF approach outperforms the mixed 1,2\ell_{1,2} GSBF approach, which demonstrates that enhance the group sparsity further reduces the overall network power consumption. In particular, we also observe that the performance gap between the blue and the red curve approximately remains at 9%9\% when SINR ranges from 44dB to 88dB, which indicates that the proposed log-sum function based three-stage GSBF approach is still attractive in the high SINR regime.

Tables I and II further demonstrate the number of inference tasks performed by edge BSs and the transmission power consumption, respectively. To be specific, in Table I, we observe that the number of performed inference tasks among three approaches is different under various SINRs, which shows the existence of the task selection strategy. Besides, it is observed that the log-sum function based three-stage GSBF approach always achieves a less number of performed inference tasks compared to the mixed 1,2\ell_{1,2} GSBF approach for target SINRs, which indicates that the log-sum function based three-stage GSBF approach can enhance the group sparsity pattern in the beamforming vector. Meanwhile, as observed in Table II, the CB approach has the lowest transmission power compared to the other two approaches because the CB approach only optimizes the power consumption in transmission with performing all inference tasks. On the other hand, the transmission power consumption of the log-sum function based three-stage GSBF approach is slightly higher compared to the mixed 1,2\ell_{1,2} GSBF approach under most SINRs. This is because more edge BSs participate in performing inference tasks in the mixed 1,2\ell_{1,2} GSBF approach, resulting in a higher transmit beamforming gain for reducing transmission power. In other words, less number of performed inference tasks further reduces the computation power consumption of edge BSs but increases the transmission power consumption. Observe the Fig. 4 and Tables I-II together, it indicates that the proposed joint task selection strategy and GSBF approach find a good balance between computation power consumption and transmission power consumption, yielding lowest network power consumption.

Refer to caption
Figure 4: Average total network power consumption comparison for three different approaches in edge AI inference system.
TABLE I: The Number of Performed Inference Tasks with Different Approaches
Target SINR [dB] Proposed Mixed 1,2\ell_{1,2} GSBF CB
0 16.4916.49 17.7917.79 120.00120.00
22 21.0021.00 22.5022.50 120.00120.00
44 30.0830.08 34.4734.47 120.00120.00
66 43.5743.57 52.1752.17 120.00120.00
88 67.7667.76 77.3677.36 120.00120.00
TABLE II: The Total Transmit Power Consumption with Different Approaches
Target SINR [dB] Proposed Mixed 1,2\ell_{1,2} GSBF CB
0 4.164.16 4.684.68 0.410.41
22 9.129.12 8.318.31 0.880.88
44 12.2412.24 11.7611.76 1.991.99
66 15.8115.81 14.7014.70 4.574.57
88 18.8118.81 18.5818.58 10.7910.79

VI Conclusion

In this paper, we developed an energy-efficient edge AI inference system through the joint selection of the inference tasks and optimization of the transmit beamforming vectors for minimizing the computation power consumption and the downlink transmission power consumption, respectively. Based on the critical insight that the inference tasks selection can be achieved by controlling the group sparsity structure of transmit beamforming vectors, we developed a group sparse optimization framework for network power minimization, for which a log-sum function based three-stage group sparse beamforming algorithm was developed to enhance group sparsity in the solutions. To resolve the resulting nonconvex and nonsmooth log-sum function minimization problem, we further proposed a proximal iteratively reweighted algorithm. Furthermore, the global convergence analysis was provided, and a worst-case O(1/t)O(1/t) convergence rate in an ergodic sense has been derived for this algorithm.

Appendix A Proof of Lemma 1

Let 𝒙[i]𝒗𝒢2\bm{x}^{[i]}\in\partial\|\bm{v}\|_{\mathcal{G}_{2}} and 𝒖[i]𝒩𝒞(𝒗[i])\bm{u}^{[i]}\in\mathcal{N}_{\mathcal{C}}(\bm{v}^{[i]}). If 𝒗[i]=argmin𝒗G(𝒗;𝒗[i])\bm{v}^{[i]}=\operatornamewithlimits{arg\,min}_{\bm{v}}G(\bm{v};\bm{v}^{[i]}), by (24), we have

𝟎=𝒘[i]𝒙[i]+𝒖[i].\displaystyle\bm{0}=\bm{w}^{[i]}\odot\bm{x}^{[i]}+\bm{u}^{[i]}. (30)

We conclude that 𝒗[i]\bm{v}^{[i]} satisfies (22), indicating that 𝒗[i]\bm{v}^{[i]} is first-order optimal for (20).

Conversely, if 𝒗[i]\bm{v}^{[i]} satisfies (22), implying 𝒗[i]\bm{v}^{[i]} satisfies (30) by Proposition 2. Thus 𝒗[i]\bm{v}^{[i]} must be the optimal solution to the subproblem (15). This completes the proof.

Appendix B Proof of Lemma 2

First of all, 𝒗[i+1]=argmin𝒗G(𝒗;𝒗[i])\bm{v}^{[i+1]}=\operatornamewithlimits{arg\,min}_{\bm{v}}G(\bm{v};\bm{v}^{[i]}) and G(𝒗;𝒗[i])G(\bm{v};\bm{v}^{[i]}) is convex, so that G(𝒗[i+1];𝒗[i])G(𝒗[i];𝒗[i])0G(\bm{v}^{[i+1]};\bm{v}^{[i]})-G(\bm{v}^{[i]};\bm{v}^{[i]})\leq 0. Since fp()f_{p}(\cdot) is concave, we have,

fp(𝒛)fp(𝒛0)+fp(𝒛0),(𝒛𝒛0),𝒛,𝒛0+L.\displaystyle f_{p}(\bm{z})\leq f_{p}(\bm{z}_{0})+\langle\nabla f_{p}(\bm{z}_{0}),(\bm{z}-\bm{z}_{0})\rangle,~{}\forall\bm{z},\bm{z}_{0}\in\mathbb{R}_{+}^{L}. (31)

Therefore

J(𝒗[i+1])=Ω(𝒗[i+1])Ω(𝒗[i])+n=1Nk=1Kwnk[i](𝒗nk[i+1]2𝒗nk[i]2)+β2𝒗[i+1]𝒗[i]2=J(𝒗[i])+G(𝒗[i+1];𝒗[i])G(𝒗[i];𝒗[i]),\displaystyle\begin{aligned} J(\bm{v}^{[i+1]})&=\Omega(\bm{v}^{[i+1]})\\ &\leq\Omega(\bm{v}^{[i]})+\sum_{n=1}^{N}\sum_{k=1}^{K}w_{nk}^{[i]}(\|\bm{v}_{nk}^{[i+1]}\|_{2}-\|\bm{v}_{nk}^{[i]}\|_{2})\\ &\quad+\frac{\beta}{2}\|\bm{v}^{[i+1]}-\bm{v}^{[i]}\|_{2}\\ &=J(\bm{v}^{[i]})+G(\bm{v}^{[i+1]};\bm{v}^{[i]})-G(\bm{v}^{[i]};\bm{v}^{[i]}),\end{aligned} (32)

where the first inequality follows from (31). This completes the first statement (i)(i).

On the other hand, by (32), we have

ΔG(𝒗[i+1];𝒗[i])J(𝒗[i])J(𝒗[i+1]).\displaystyle\Delta G(\bm{v}^{[i+1]};\bm{v}^{[i]})\leq J(\bm{v}^{[i]})-J(\bm{v}^{[i+1]}). (33)

Summing both sides of (33) over i=0,,ti=0,\ldots,t, yielding

0i=0tΔG(𝒗[i+1];𝒗[i])J(𝒗[0])J(𝒗[t+1])J(𝒗[0])J~,\displaystyle\begin{aligned} 0\leq\sum_{i=0}^{t}\Delta G(\bm{v}^{[i+1]};\bm{v}^{[i]})&\leq J(\bm{v}^{[0]})-J(\bm{v}^{[t+1]})\\ &\leq J(\bm{v}^{[0]})-\tilde{J},\end{aligned} (34)

where J~\tilde{J} is the lower bound of J()J(\cdot). Allowing tt\to\infty, we have

limiΔG(𝒗[i+1];𝒗[i])=0.\displaystyle\lim_{i\to\infty}\Delta G(\bm{v}^{[i+1]};\bm{v}^{[i]})=0. (35)

This completes the second statement (ii)(ii).

For the last statement (iii)(iii), inspired by the proof line presented in [37]. By (24), we have

𝟎=β(𝒗[i+1]𝒗[i])+𝒙~[i+1]+𝒖[i+1],\displaystyle\bm{0}=\beta(\bm{v}^{[i+1]}-\bm{v}^{[i]})+\widetilde{\bm{x}}^{[i+1]}+\bm{u}^{[i+1]}, (36)

where 𝒙~[i+1]𝒘[i],𝒗[i+1]𝒢2\widetilde{\bm{x}}^{[i+1]}\in\partial\langle\bm{w}^{[i]},\|\bm{v}^{[i+1]}\|_{\mathcal{G}_{2}}\rangle, 𝒖[i]𝒩𝒞(𝒗[i])\bm{u}^{[i]}\in\mathcal{N}_{\mathcal{C}}(\bm{v}^{[i]}). Taking a dot-product with (𝒗[i+1]𝒗[i])(\bm{v}^{[i+1]}-\bm{v}^{[i]}) on both sides of (36) yields

𝟎=β𝒗[i+1]𝒗[i]22+𝒙~[i+1],𝒗[i+1]𝒗[i]+𝒖[i+1],𝒗[i+1]𝒗[i].\displaystyle\begin{aligned} \bm{0}=&\beta\|\bm{v}^{[i+1]}-\bm{v}^{[i]}\|_{2}^{2}+\langle\widetilde{\bm{x}}^{[i+1]},\bm{v}^{[i+1]}-\bm{v}^{[i]}\rangle\\ &+\langle\bm{u}^{[i+1]},\bm{v}^{[i+1]}-\bm{v}^{[i]}\rangle.\end{aligned} (37)

By the definition of subgradient of the convex function, we have,

𝒘[i],𝒗[i]𝒢2𝒗[i+1]𝒢2𝒙~[i+1],𝒗[i]𝒗[i+1]=𝒖[i+1],𝒗[i+1]𝒗[i]+β𝒗[i+1]𝒗[i]22.\displaystyle\begin{aligned} \langle\bm{w}^{[i]},\|\bm{v}^{[i]}\|_{\mathcal{G}_{2}}-\|\bm{v}^{[i+1]}\|_{\mathcal{G}_{2}}\rangle&\geq\langle\widetilde{\bm{x}}^{[i+1]},\bm{v}^{[i]}-\bm{v}^{[i+1]}\rangle\\ &=\langle\bm{u}^{[i+1]},\bm{v}^{[i+1]}-\bm{v}^{[i]}\rangle\\ &\quad+\beta\|\bm{v}^{[i+1]}-\bm{v}^{[i]}\|_{2}^{2}.\end{aligned} (38)

Therefore

G(𝒗[i];𝒗[i])G(𝒗[i+1];𝒗[i])=n=1Nk=1Kwnk[i](𝒗nk[i]2𝒗nk[i+1]2)β2𝒗[i+1]𝒗[i]22=𝒘[i],𝒗[i]2𝒗[i+1]2β2𝒗[i+1]𝒗[i]22𝒖[i+1],𝒗[i+1]𝒗[i]+β2𝒗[i+1]𝒗[i]22=β2𝒗[i]𝒗[i+1]22𝒖[i+1],𝒗[i]𝒗[i+1]β2𝒗[i]𝒗[i+1]22,\displaystyle\begin{aligned} &\quad\,\,G(\bm{v}^{[i]};\bm{v}^{[i]})-G(\bm{v}^{[i+1]};\bm{v}^{[i]})\\ &=\!\sum_{n=1}^{N}\!\sum_{k=1}^{K}w_{nk}^{[i]}(\|\bm{v}_{nk}^{[i]}\|_{2}-\|\bm{v}_{nk}^{[i+1]}\|_{2})\!-\!\frac{\beta}{2}\|\bm{v}^{[i+1]}-\bm{v}^{[i]}\|_{2}^{2}\\ &=\langle\bm{w}^{[i]},\|\bm{v}^{[i]}\|_{2}-\|\bm{v}^{[i+1]}\|_{2}\rangle-\frac{\beta}{2}\|\bm{v}^{[i+1]}-\bm{v}^{[i]}\|_{2}^{2}\\ &\geq\langle\bm{u}^{[i+1]},\bm{v}^{[i+1]}-\bm{v}^{[i]}\rangle+\frac{\beta}{2}\|\bm{v}^{[i+1]}-\bm{v}^{[i]}\|_{2}^{2}\\ &=\frac{\beta}{2}\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\|_{2}^{2}-\langle\bm{u}^{[i+1]},\bm{v}^{[i]}-\bm{v}^{[i+1]}\rangle\\ &\geq\frac{\beta}{2}\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\|_{2}^{2},\end{aligned} (39)

where the first inequality is obtained by (38) and the last inequality holds since 𝒖[i+1]𝒩𝒞(𝒗[i+1])\bm{u}^{[i+1]}\in\mathcal{N}_{\mathcal{C}}(\bm{v}^{[i+1]}), completing the proof.

Appendix C Proof of Theorem 2

By Lemma 2, the sequence {J(𝒗[i])}\{J(\bm{v}^{[i]})\} is monotonically decreasing. Since J()J(\cdot) is coercive, we conclude that the sequence {𝒗[i]}\{\bm{v}^{[i]}\} is bounded. We conclude that {𝒗[i]}\{\bm{v}^{[i]}\} must have cluster points. Let 𝒗\bm{v}^{*} be a cluster point of {𝒗[i]}\{\bm{v}^{[i]}\}. From Lemma 1, it suffices to show that 𝒗=argmin𝒗G(𝒗;𝒗)\bm{v}^{*}=\operatornamewithlimits{arg\,min}_{\bm{v}}G(\bm{v};\bm{v}^{*}). We prove this by contradiction. Assume that there exists a point 𝒗~\tilde{\bm{v}} such that ϵ:=G(𝒗;𝒗)G(𝒗~;𝒗)>0\epsilon:=G(\bm{v}^{*};\bm{v}^{*})-G(\tilde{\bm{v}};\bm{v}^{*})>0. Suppose {𝒗[i]}𝒮𝒗\{\bm{v}^{[i]}\}_{\mathcal{S}}\to\bm{v}^{*}, 𝒮\mathcal{S}\subset\mathbb{N}. Based on Lemma 2, there must exist k1>0k_{1}>0 such that for all k>k1,k𝒮k>k_{1},k\in\mathcal{S}

G(𝒗[i];𝒗[i])G(𝒗[i+1];𝒗[i])ϵ/4.\displaystyle G(\bm{v}^{[i]};\bm{v}^{[i]})-G(\bm{v}^{[i+1]};\bm{v}^{[i]})\leq\epsilon/4. (40)

Note that 𝒗nk[i]𝒮𝒗nk\bm{v}_{nk}^{[i]}\stackrel{{\scriptstyle\mathcal{S}}}{{\to}}\bm{v}_{nk}^{*} and wnk[i]𝒮wnkw_{nk}^{[i]}\stackrel{{\scriptstyle\mathcal{S}}}{{\to}}w_{nk}^{*}, there exists k2k_{2} such that for all k>k2,k𝒮k>k_{2},k\in\mathcal{S}, n=1Nk=1Kwnk[i]𝒗nk[i]2wnk𝒗nk2>ϵ/10,\!\sum_{n=1}^{N}\!\sum_{k=1}^{K}\!w_{nk}^{[i]}\|\bm{v}_{nk}^{[i]}\|_{2}-w_{nk}^{*}\|\bm{v}_{nk}^{*}\|_{2}>-\epsilon/10, and β2𝒗𝒗~22+n,k(wnkwnk[i])𝒗~nk2β2𝒗~𝒗[i]22>ϵ/10\frac{\beta}{2}\|\bm{v}^{*}\!-\!\tilde{\bm{v}}\|_{2}^{2}+\!\sum_{n,k}\!(w_{nk}^{*}\!-\!w_{nk}^{[i]})\|\tilde{\bm{v}}_{nk}\|_{2}\!-\!\frac{\beta}{2}\|\tilde{\bm{v}}\!-\!\bm{v}^{[i]}\|_{2}^{2}>-\epsilon/10.

Therefore, for all k>k2,k𝒮k>k_{2},k\in\mathcal{S},

G(𝒗;𝒗)G(𝒗~;𝒗[i])=n=1Nk=1Kwnk𝒗nk2β2𝒗~𝒗[i]22n=1Nk=1K(wnk(wnkwnk[i]))𝒗~nk2=[G(𝒗;𝒗)G(𝒗~;𝒗)]+β2𝒗~𝒗22+n=1Nk=1K(wnkwnk[i])𝒗~nk2β2𝒗~𝒗[i]22ϵϵ/10=9ϵ/10,\displaystyle\begin{aligned} &\quad\,\,G(\bm{v}^{*};\bm{v}^{*})-G(\tilde{\bm{v}};\bm{v}^{[i]})\\ &=\sum_{n=1}^{N}\sum_{k=1}^{K}w_{nk}^{*}\|\bm{v}_{nk}^{*}\|_{2}-\frac{\beta}{2}\|\tilde{\bm{v}}-\bm{v}^{[i]}\|_{2}^{2}\\ &\quad-\sum_{n=1}^{N}\sum_{k=1}^{K}(w_{nk}^{*}-(w_{nk}^{*}-w_{nk}^{[i]}))\|\tilde{\bm{v}}_{nk}\|_{2}\\ &=\left[G(\bm{v}^{*};\bm{v}^{*})-G(\tilde{\bm{v}};\bm{v}^{*})\right]+\frac{\beta}{2}\|\tilde{\bm{v}}-\bm{v}^{*}\|_{2}^{2}\\ &\quad+\sum_{n=1}^{N}\sum_{k=1}^{K}(w_{nk}^{*}-w_{nk}^{[i]})\|\tilde{\bm{v}}_{nk}\|_{2}-\frac{\beta}{2}\|\tilde{\bm{v}}-\bm{v}^{[i]}\|_{2}^{2}\\ &\geq\epsilon-\epsilon/10=9\epsilon/10,\end{aligned} (41)

and that

G(𝒗[i];𝒗[i])G(𝒗;𝒗)=n=1Nk=1Kwnk[i]𝒗nk[i]2n=1Nk=1Kwnk𝒗nk>ϵ/10.\displaystyle\begin{aligned} &\quad\,\,G(\bm{v}^{[i]};\bm{v}^{[i]})-G(\bm{v}^{*};\bm{v}^{*})\\ &=\sum_{n=1}^{N}\sum_{k=1}^{K}w_{nk}^{[i]}\|\bm{v}_{nk}^{[i]}\|_{2}-\sum_{n=1}^{N}\sum_{k=1}^{K}w_{nk}^{*}\|\bm{v}_{nk}^{*}\|>-\epsilon/10.\end{aligned} (42)

Hence, for all k>max(k1,k2),k𝒮k>\max(k_{1},k_{2}),k\in\mathcal{S}, it holds that

G(𝒗[i];𝒗[i])G(𝒗~;𝒗[i])=G(𝒗[i];𝒗[i])G(𝒗;𝒗)+G(𝒗;𝒗)G(𝒗~;𝒗[i])ϵ/10+9ϵ/10=4ϵ/5,\displaystyle\begin{aligned} &\quad\,\,G(\bm{v}^{[i]};\bm{v}^{[i]})-G(\tilde{\bm{v}};\bm{v}^{[i]})\\ &=G(\bm{v}^{[i]};\bm{v}^{[i]})-G(\bm{v}^{*};\bm{v}^{*})+G(\bm{v}^{*};\bm{v}^{*})-G(\tilde{\bm{v}};\bm{v}^{[i]})\\ &\geq-\epsilon/10+9\epsilon/10=4\epsilon/5,\end{aligned} (43)

contradicting (40). Hence, 𝒗=argmin𝒗G(𝒗;𝒗)\bm{v}^{*}=\operatornamewithlimits{arg\,min}_{\bm{v}}G(\bm{v};\bm{v}^{*}). By Lemma 1, 𝒗\bm{v}^{*} satisfies the first-order necessary optimality for (20). This completes the proof.

Appendix D Proof of Lemma 3

Recall the first-order necessary optimality condition (24) of subproblem (15). By rearranging the term, and we have

𝒗[i]𝒗[i+1]=1β(𝒘[i]𝒙[i+1]+𝒖[i+1]),\displaystyle\bm{v}^{[i]}-\bm{v}^{[i+1]}=\frac{1}{\beta}(\bm{w}^{[i]}\odot\bm{x}^{[i+1]}+\bm{u}^{[i+1]}), (44)

where 𝒙[i+1]𝒗[i+1]𝒢2\bm{x}^{[i+1]}\in\partial\|\bm{v}^{[i+1]}\|_{\mathcal{G}_{2}} and 𝒖[i+1]𝒩𝒞(𝒗[i+1])\bm{u}^{[i+1]}\in\mathcal{N}_{\mathcal{C}}(\bm{v}^{[i+1]}). Square on both sides of (44), we have

𝒗[i]𝒗[i+1]22=1β2𝒘[i]𝒙[i+1]+𝒖[i+1]22=1β2(𝒘[i]𝒘[i+1])𝒙[i+1]+𝒓[i+1]22=1β2(𝒘[i]𝒘[i+1])𝒙[i+1]22+1β2𝒓[i+1]22+2β2(𝒘[i]𝒘[i+1])𝒙[i+1],𝒓[i+1]=1β2(𝒘[i]𝒘[i+1])𝒙[i+1]22+1β2𝒓[i+1]222β2(𝒘[i]𝒘[i+1])𝒙[i+1],(𝒘[i]𝒘[i+1])𝒙[i+1]+β(𝒗[i+1]𝒗[i])=1β2(𝒘[i]𝒘[i+1])𝒙[i+1]22+1β2𝒓[i+1]222β(𝒘[i]𝒘[i+1])𝒙[i+1],𝒗[i+1]𝒗[i],\displaystyle\begin{aligned} &\quad\,\,\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\|_{2}^{2}\\ &=\frac{1}{\beta^{2}}\|\bm{w}^{[i]}\odot\bm{x}^{[i+1]}+\bm{u}^{[i+1]}\|_{2}^{2}\\ &=\frac{1}{\beta^{2}}\|(\bm{w}^{[i]}-\bm{w}^{[i+1]})\odot\bm{x}^{[i+1]}+\bm{r}^{[i+1]}\|_{2}^{2}\\ &=\frac{1}{\beta^{2}}\|(\bm{w}^{[i]}-\bm{w}^{[i+1]})\odot\bm{x}^{[i+1]}\|_{2}^{2}+\frac{1}{\beta^{2}}\|\bm{r}^{[i+1]}\|_{2}^{2}\\ &\quad+\frac{2}{\beta^{2}}\langle(\bm{w}^{[i]}-\bm{w}^{[i+1]})\odot\bm{x}^{[i+1]},\bm{r}^{[i+1]}\rangle\\ &=\frac{1}{\beta^{2}}\|(\bm{w}^{[i]}-\bm{w}^{[i+1]})\odot\bm{x}^{[i+1]}\|_{2}^{2}+\frac{1}{\beta^{2}}\|\bm{r}^{[i+1]}\|_{2}^{2}\\ &\quad-\frac{2}{\beta^{2}}\langle(\bm{w}^{[i]}-\bm{w}^{[i+1]})\odot\bm{x}^{[i+1]},\\ &\quad(\bm{w}^{[i]}-\bm{w}^{[i+1]})\odot\bm{x}^{[i+1]}+\beta(\bm{v}^{[i+1]}-\bm{v}^{[i]})\rangle\\ &=-\frac{1}{\beta^{2}}\|(\bm{w}^{[i]}-\bm{w}^{[i+1]})\odot\bm{x}^{[i+1]}\|_{2}^{2}+\frac{1}{\beta^{2}}\|\bm{r}^{[i+1]}\|_{2}^{2}\\ &\quad-\frac{2}{\beta}\langle(\bm{w}^{[i]}-\bm{w}^{[i+1]})\odot\bm{x}^{[i+1]},\bm{v}^{[i+1]}-\bm{v}^{[i]}\rangle,\end{aligned} (45)

where we exploit 𝒖[i+1]=β(𝒗[i]𝒗[i+1])𝒘[i]𝒙[i+1]\bm{u}^{[i+1]}=\beta(\bm{v}^{[i]}-\bm{v}^{[i+1]})-\bm{w}^{[i]}\odot\bm{x}^{[i+1]} to obtain the fourth equality. Then we have

𝒓[i+1]22=β2𝒗[i]𝒗[i+1]22+(𝒘[i]𝒘[i+1])𝒙[i+1]22+2β(𝒘[i]𝒘[i+1])𝒙[i+1],𝒗[i+1]𝒗[i]β2𝒗[i]𝒗[i+1]22+𝒘[i]𝒘[i+1]22+2β(𝒘[i]𝒘[i+1])𝒙[i+1]2𝒗[i]𝒗[i+1]2β2𝒗[i]𝒗[i+1]22+𝒘[i]𝒘[i+1]22+2β𝒘[i]𝒘[i+1]2𝒗[i]𝒗[i+1]2β2𝒗[i]𝒗[i+1]22+p4κ2𝒗[i]𝒗[i+1]22+2βp2κ𝒗[i]𝒗[i+1]22=(β2+2βκp2+κ2p4)𝒗[i]𝒗[i+1]22,\displaystyle\begin{aligned} &\quad\,\,\|\bm{r}^{[i+1]}\|_{2}^{2}\\ &=\beta^{2}\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\|_{2}^{2}+\|(\bm{w}^{[i]}-\bm{w}^{[i+1]})\odot\bm{x}^{[i+1]}\|_{2}^{2}\\ &\quad+2\beta\langle(\bm{w}^{[i]}-\bm{w}^{[i+1]})\odot\bm{x}^{[i+1]},\bm{v}^{[i+1]}-\bm{v}^{[i]}\rangle\\ &\leq\beta^{2}\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\|_{2}^{2}+\|\bm{w}^{[i]}-\bm{w}^{[i+1]}\|_{2}^{2}\\ &\quad+2\beta\|(\bm{w}^{[i]}-\bm{w}^{[i+1]})\odot\bm{x}^{[i+1]}\|_{2}\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\|_{2}\\ &\leq\beta^{2}\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\|_{2}^{2}+\|\bm{w}^{[i]}-\bm{w}^{[i+1]}\|_{2}^{2}\\ &\quad+2\beta\|\bm{w}^{[i]}-\bm{w}^{[i+1]}\|_{2}\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\|_{2}\\ &\leq\beta^{2}\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\|_{2}^{2}+p^{4}\kappa^{2}\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\|_{2}^{2}\\ &\quad+2\beta p^{2}\kappa\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\|_{2}^{2}\\ &=(\beta^{2}+2\beta\kappa p^{2}+\kappa^{2}p^{4})\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\|_{2}^{2},\end{aligned} (46)

where the last inequality holds due to fp()f^{\prime}_{p}(\cdot) is p2p^{2}-Lipschitz continuous and κ=ρmax\kappa=\rho^{\max}. This completes the proof.

Appendix E Construction of Fenchel-Rockafellar Dual (26)

This construction of the dual program owes to the conjugate function [47]. We first rewrite the objective G(𝒗;𝒗[i])G(\bm{v};\bm{v}^{[i]}) by exploiting the conjugate function as

G(𝒗;𝒗[i])=n=1Nk=1Kwnk[i][sup𝝀nk𝝀nk,𝒗nkδ𝔹(𝝀nk)]+β2𝒗𝒗[i]22+sup𝝁[𝝁,𝒗δ𝒞(𝝁)],\displaystyle\begin{aligned} G(\bm{v};\bm{v}^{[i]})&=\sum_{n=1}^{N}\sum_{k=1}^{K}w_{nk}^{[i]}[\sup_{\bm{\lambda}_{nk}}\langle\bm{\lambda}_{nk},\bm{v}_{nk}\rangle-\delta_{\mathbb{B}}(\bm{\lambda}_{nk})]\\ &\quad+\frac{\beta}{2}\|\bm{v}-\bm{v}^{[i]}\|_{2}^{2}+\sup_{\bm{\mu}}[\langle\bm{\mu},\bm{v}\rangle-\delta_{\mathcal{C}}^{*}(\bm{\mu})],\end{aligned} (47)

where 𝔹:={𝒚L𝒚21}\mathbb{B}:=\{\bm{y}\in\mathbb{C}^{L}\mid\|\bm{y}\|_{2}\leq 1\} denotes the dual norm unit ball of 2\|\cdot\|_{2}. Then the primal subproblem (21) is simply

inf𝒗G(𝒗;𝒗[i])=inf𝒗sup𝝀,𝝁\displaystyle\inf_{\bm{v}}G(\bm{v};\bm{v}^{[i]})=\inf_{\bm{v}}\sup_{\bm{\lambda},\bm{\mu}}{} n=1Nk=1Kwnk[i][𝝀nk,𝝁nkδ𝔹(𝝀nk)]\displaystyle\sum_{n=1}^{N}\sum_{k=1}^{K}w_{nk}^{[i]}[\langle\bm{\lambda}_{nk},\bm{\mu}_{nk}\rangle-\delta_{\mathbb{B}}(\bm{\lambda}_{nk})] (48)
+β2𝒗𝒗[i]22+𝝁,𝒗δ𝒞(𝝁).\displaystyle+\frac{\beta}{2}\|\bm{v}-\bm{v}^{[i]}\|_{2}^{2}+\langle\bm{\mu},\bm{v}\rangle-\delta_{\mathcal{C}}^{*}(\bm{\mu}).

Swapping the order of inf\inf and sup\sup gives the corresponding Lagrangian dual, i.e.,

sup𝝀,𝝁inf𝒗n=1Nn=kKwnk[i][𝝀nk,𝒗nkδ𝔹(𝝀nk)]+β2𝒗𝒗[i]22+𝝁,𝒗δ𝒞(𝝁).\displaystyle\begin{aligned} \sup_{\bm{\lambda},\bm{\mu}}\inf_{\bm{v}}~{}&\sum_{n=1}^{N}\sum_{n=k}^{K}w_{nk}^{[i]}[\langle\bm{\lambda}_{nk},\bm{v}_{nk}\rangle-\delta_{\mathbb{B}}(\bm{\lambda}_{nk})]\\ &+\frac{\beta}{2}\|\bm{v}-\bm{v}^{[i]}\|_{2}^{2}+\langle\bm{\mu},\bm{v}\rangle-\delta_{\mathcal{C}}^{*}(\bm{\mu}).\end{aligned} (49)

The dual objective at 𝒗[i]\bm{v}^{[i]} is then given by

Q(𝝀,𝝁;𝒗[i])=inf𝒗\displaystyle Q(\bm{\lambda},\bm{\mu};\bm{v}^{[i]})=\inf_{\bm{v}}{} n=1Nk=1Kwnk[i][𝝀nk,𝒗nkδ𝔹(𝝀nk)]\displaystyle\sum_{n=1}^{N}\sum_{k=1}^{K}w_{nk}^{[i]}[\langle\bm{\lambda}_{nk},\bm{v}_{nk}\rangle-\delta_{\mathbb{B}}(\bm{\lambda}_{nk})] (50)
+β2𝒗𝒗[i]22+𝝁,𝒗δ𝒞(𝝁).\displaystyle+\frac{\beta}{2}\|\bm{v}-\bm{v}^{[i]}\|_{2}^{2}+\langle\bm{\mu},\bm{v}\rangle-\delta_{\mathcal{C}}^{*}(\bm{\mu}).

By exploiting the first-order necessary optimality condition of (50), we have

sup𝝀,𝝁\displaystyle\sup_{\bm{\lambda},\bm{\mu}}{} 12β𝝀𝒘[i]+𝝁β𝒗[i]22+β2𝒗[i]22δ𝒞(𝝁)\displaystyle-\frac{1}{2\beta}\|\bm{\lambda}\odot\bm{w}^{[i]}+\bm{\mu}-\beta\bm{v}^{[i]}\|_{2}^{2}+\frac{\beta}{2}\|\bm{v}^{[i]}\|_{2}^{2}-\delta_{\mathcal{C}}^{*}(\bm{\mu}) (51)
n=1Nk=1Kwnk[i]δ𝔹(𝝀nk),\displaystyle\quad-\sum_{n=1}^{N}\sum_{k=1}^{K}w_{nk}^{[i]}\delta_{\mathbb{B}}(\bm{\lambda}_{nk}),

which is consistent with the presented Fenchel-Rockafellar in (26). This completes the construction.

Appendix F Proof of Theorem 3

We evaluate g(𝒗,𝒙,𝒖;𝒗[i])g(\bm{v},\bm{x},\bm{u};\bm{v}^{[i]}) at 𝒗[i]\bm{v}^{[i]} with 𝝀[i+1]𝒗[i+1]𝒢2\bm{\lambda}^{[i+1]}\!\in\!\partial\|\bm{v}^{[i+1]}\|_{\mathcal{G}_{2}} and 𝝁[i+1]𝒩𝒞(𝒗[i+1])\bm{\mu}^{[i+1]}\in\mathcal{N}_{\mathcal{C}}(\bm{v}^{[i+1]}). Then, we have

g(𝒗[i],𝝀[i+1],𝝁[i+1];𝒗[i])\displaystyle\quad\,\,g(\bm{v}^{[i]},\bm{\lambda}^{[i+1]},\bm{\mu}^{[i+1]};\bm{v}^{[i]}) (52)
=G(𝒗[i];𝒗[i])Q(𝝀[i+1],𝝁[i+1];𝒗[i])\displaystyle=G(\bm{v}^{[i]};\bm{v}^{[i]})-Q(\bm{\lambda}^{[i+1]},\bm{\mu}^{[i+1]};\bm{v}^{[i]})
=n=1Nk=1Kwnk[i]𝒗nk[i]2+δ𝒞(𝒗[i])β2𝒗[i]22\displaystyle=\sum_{n=1}^{N}\sum_{k=1}^{K}w_{nk}^{[i]}\|\bm{v}_{nk}^{[i]}\|_{2}+\delta_{\mathcal{C}}(\bm{v}^{[i]})-\frac{\beta}{2}\|\bm{v}^{[i]}\|_{2}^{2}
+12β𝝀[i+1]𝒘[i]+𝝁[i+1]β𝒗[i]22+δ𝒞(𝝁[i+1])\displaystyle\quad+\frac{1}{2\beta}\|\bm{\lambda}^{[i+1]}\odot\bm{w}^{[i]}+\bm{\mu}^{[i+1]}-\beta\bm{v}^{[i]}\|_{2}^{2}+\delta_{\mathcal{C}}^{*}(\bm{\mu}^{[i+1]})
12β𝒘[i]𝝀[i+1]+𝝁[i+1]22+𝒘[i],𝒗[i]𝒗[i]𝝀[i+1]\displaystyle\geq\frac{1}{2\beta}\|\bm{w}^{[i]}\!\odot\!\bm{\lambda}^{[i+1]}\!+\!\bm{\mu}^{[i+1]}\|_{2}^{2}\!+\!\langle\bm{w}^{[i]},\|\bm{v}^{[i]}\|-\bm{v}^{[i]}\!\odot\!\bm{\lambda}^{[i+1]}\rangle
+𝝁[i+1],𝒗[i]𝒗[i]\displaystyle\quad+\langle\bm{\mu}^{[i+1]},\bm{v}^{[i]}-\bm{v}^{[i]}\rangle
12β𝒘[i]𝝀[i+1]+𝝁[i+1]22,\displaystyle\geq\frac{1}{2\beta}\|\bm{w}^{[i]}\odot\bm{\lambda}^{[i+1]}+\bm{\mu}^{[i+1]}\|_{2}^{2},

where the first inequality holds due to the Fenchel-Young inequality and the second inequality is obtained since 𝝀[i+1]21\|\bm{\lambda}^{[i+1]}\|_{2}\leq 1 makes 𝒗[i]𝒢2𝒗[i]𝝀[i+1]0\|\bm{v}^{[i]}\|_{\mathcal{G}_{2}}-\bm{v}^{[i]}\odot\bm{\lambda}^{[i+1]}\geq 0.

Next, by making use of (24) to replace 𝝁[i+1]\bm{\mu}^{[i+1]} with β(𝒗[i]𝒗[i+1])𝒘[i]𝝀[i+1]\beta(\bm{v}^{[i]}-\bm{v}^{[i+1]})-\bm{w}^{[i]}\odot\bm{\lambda}^{[i+1]} in (52), we have

12β𝒘[i]𝝀[i+1]+𝝁[i+1]22\displaystyle\quad\,\,\frac{1}{2\beta}\|\bm{w}^{[i]}\odot\bm{\lambda}^{[i+1]}+\bm{\mu}^{[i+1]}\|_{2}^{2} (53)
=12β𝒘[i+1]𝝀[i+1]+𝝁[i+1]+(𝒘[i]𝒘[i+1])𝝀[i+1]22\displaystyle=\frac{1}{2\beta}\|\bm{w}^{[i+1]}\!\odot\!\bm{\lambda}^{[i+1]}+\bm{\mu}^{[i+1]}+(\bm{w}^{[i]}-\bm{w}^{[i+1]})\!\odot\!\bm{\lambda}^{[i+1]}\|_{2}^{2}
=12β(𝒓[i+1]22+(𝒘[i]𝒘[i+1])𝝀[i+1]22\displaystyle=\frac{1}{2\beta}(\|\bm{r}^{[i+1]}\|_{2}^{2}+\|(\bm{w}^{[i]}-\bm{w}^{[i+1]})\odot\bm{\lambda}^{[i+1]}\|_{2}^{2}
+2𝒓[i+1],(𝒘[i]𝒘[i+1])𝝀[i+1])\displaystyle\quad+2\langle\bm{r}^{[i+1]},(\bm{w}^{[i]}-\bm{w}^{[i+1]})\odot\bm{\lambda}^{[i+1]}\rangle)
=12β𝒓[i+1]2212β(𝒘[i]𝒘[i+1])𝝀[i+1]22\displaystyle=\frac{1}{2\beta}\|\bm{r}^{[i+1]}\|_{2}^{2}-\frac{1}{2\beta}\|(\bm{w}^{[i]}-\bm{w}^{[i+1]})\odot\bm{\lambda}^{[i+1]}\|_{2}^{2}
𝒗[i+1]𝒗[i],(𝒘[i]𝒘[i+1])𝝀[i+1]\displaystyle\quad-\langle\bm{v}^{[i+1]}-\bm{v}^{[i]},(\bm{w}^{[i]}-\bm{w}^{[i+1]})\odot\bm{\lambda}^{[i+1]}\rangle
12β𝒓[i+1]22𝒗[i]𝒗[i+1]2𝒘[i]𝒘[i+1]2\displaystyle\geq\frac{1}{2\beta}\|\bm{r}^{[i+1]}\|_{2}^{2}-\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\|_{2}\|\bm{w}^{[i]}-\bm{w}^{[i+1]}\|_{2}
12β𝒘[i]𝒘[i+1]22,\displaystyle\quad-\frac{1}{2\beta}\|\bm{w}^{[i]}-\bm{w}^{[i+1]}\|_{2}^{2},

where we exploit Cauchy–Bunyakovskii–Schwarz (CBS) inequality to obtain the last inequality.

On the other hand, by strong duality, we have

g(𝒗[i],𝝀[i+1],𝝁[i+1];𝒗[i])=G(𝒗[i];𝒗[i])G(𝒗[i+1];𝒗[i]).\displaystyle g(\bm{v}^{[i]},\bm{\lambda}^{[i+1]},\bm{\mu}^{[i+1]};\bm{v}^{[i]})\!=\!G(\bm{v}^{[i]};\bm{v}^{[i]})-G(\bm{v}^{[i+1]};\bm{v}^{[i]}). (54)

Combine (52), (53) and (54), we have

𝒓[i+1]22\displaystyle\quad\,\,\|\bm{r}^{[i+1]}\|_{2}^{2} (55)
2βΔG(𝒗[i+1];𝒗[i])+𝒘[i]𝒘[i+1]22\displaystyle\leq 2\beta\Delta G(\bm{v}^{[i+1]};\bm{v}^{[i]})+\|\bm{w}^{[i]}-\bm{w}^{[i+1]}\|_{2}^{2}
+2β𝒗[i]𝒗[i+1]2𝒘[i]𝒘[i+1]2\displaystyle\quad+2\beta\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\|_{2}\|\bm{w}^{[i]}-\bm{w}^{[i+1]}\|_{2}
2βΔG(𝒗[i+1];𝒗[i])+κ2p4𝒗[i]𝒗[i+1]22\displaystyle\leq 2\beta\Delta G(\bm{v}^{[i+1]};\bm{v}^{[i]})+\kappa^{2}p^{4}\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\|_{2}^{2}
+2βκp2𝒗[i]𝒗[i+1]22\displaystyle\quad+2\beta\kappa p^{2}\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\|_{2}^{2}
2βΔG(𝒗[i+1];𝒗[i])+(κ2p4+2βκp2)𝒗[i]𝒗[i+1]22\displaystyle\leq 2\beta\Delta G(\bm{v}^{[i+1]};\bm{v}^{[i]})+(\kappa^{2}p^{4}+2\beta\kappa p^{2})\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\|_{2}^{2}
2βΔG(𝒗[i+1];𝒗[i])+2β(κ2p4+2βκp2)ΔG(𝒗[i+1];𝒗[i])\displaystyle\leq 2\beta\Delta G(\bm{v}^{[i+1]};\bm{v}^{[i]})+\frac{2}{\beta}(\kappa^{2}p^{4}+2\beta\kappa p^{2})\Delta G(\bm{v}^{[i+1]};\bm{v}^{[i]})
2β(β2+2βκp2+κ2p4)ΔG(𝒗[i+1];𝒗[i])\displaystyle\leq\frac{2}{\beta}(\beta^{2}+2\beta\kappa p^{2}+\kappa^{2}p^{4})\Delta G(\bm{v}^{[i+1]};\bm{v}^{[i]})
2β(β2+2βκp2+κ2p4)(J(𝒗[i])J(𝒗[i+1])),\displaystyle\leq\frac{2}{\beta}(\beta^{2}+2\beta\kappa p^{2}+\kappa^{2}p^{4})(J(\bm{v}^{[i]})-J(\bm{v}^{[i+1]})),

where the fourth inequality holds due to Lemma 2 (iii)(iii).

Summing up both sides of above inequality from i=0i=0 to tt, we have

C(J(𝒗[0])J(𝒗[t+1]))\displaystyle C(J(\bm{v}^{[0]})-J(\bm{v}^{[t+1]})) i=0t𝒓[i+1]22\displaystyle\geq\sum_{i=0}^{t}\|\bm{r}^{[i+1]}\|_{2}^{2} (56)
tmini=0,,t𝒓[i+1]22\displaystyle\geq t\min_{i=0,\cdots,t}\|\bm{r}^{[i+1]}\|_{2}^{2}

with C=2(β2+2βκp2+κ2p4)βC=\frac{2(\beta^{2}+2\beta\kappa p^{2}+\kappa^{2}p^{4})}{\beta}. This indicates that

mini=0,,t𝒓[i+1]22\displaystyle\min_{i=0,\cdots,t}\|\bm{r}^{[i+1]}\|_{2}^{2} 2(β2+2βκp2+κ2p4)tβ(J(𝒗[0])J(𝒗[t+1]))\displaystyle\leq\!\frac{2(\beta^{2}+2\beta\kappa p^{2}+\kappa^{2}p^{4})}{t\beta}(J(\bm{v}^{[0]})\!-\!J(\bm{v}^{[t+1]})) (57)
2(β2+2βκp2+κ2p4)tβ(J(𝒗[0])J(𝒗)).\displaystyle\leq\frac{2(\beta^{2}+2\beta\kappa p^{2}+\kappa^{2}p^{4})}{t\beta}(J(\bm{v}^{[0]})-J(\bm{v}^{*})).

Hence

mini=0,,t𝒓[i+1]22=O(1t).\min_{i=0,\cdots,t}\|\bm{r}^{[i+1]}\|_{2}^{2}=O(\frac{1}{t}). (58)

This completes the proof.

References

  • [1] S. Hua, X. Yang, K. Yang, G. Yin, Y. Shi, and H. Wang, “Deep learning tasks processing in fog-RAN,” in Proc. IEEE Veh. Technol. Conf. (VTC), pp. 1–5, IEEE, 2019.
  • [2] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
  • [3] A. Voulodimos, N. Doulamis, A. Doulamis, and E. Protopapadakis, “Deep learning for computer vision: A brief review,” Comput. Intell. Neurosci., vol. 2018, pp. 1–13, 2018.
  • [4] C. V. N. Index, “Global mobile data traffic forecast update, 2017–2022,” White paper, Feb. 2019.
  • [5] Q. Zhang, L. Cheng, and R. Boutaba, “Cloud computing: state-of-the-art and research challenges,” J. Internet Services Appl., vol. 1, no. 1, pp. 7–18, 2010.
  • [6] Z. Yan and C. Chen, “Cloud-based video communication and networking: an architectural overview,” J. Commun. Information Netw., vol. 1, no. 1, pp. 27–43, 2016.
  • [7] J. Chen and X. Ran, “Deep learning with edge computing: A review,” Proc. IEEE, vol. 107, no. 8, pp. 1655–1674, 2019.
  • [8] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A survey on mobile edge computing: The communication perspective,” IEEE Commun. Surveys Tuts., vol. 19, no. 4, pp. 2322–2358, 2017.
  • [9] K. B. Letaief, W. Chen, Y. Shi, J. Zhang, and Y.-J. A. Zhang, “The roadmap to 6G: AI empowered wireless networks,” IEEE Commun. Mag., vol. 57, no. 8, pp. 84–90, 2019.
  • [10] Z. Zhang, Y. Xiao, Z. Ma, M. Xiao, Z. Ding, X. Lei, G. K. Karagiannidis, and P. Fan, “6G wireless networks: Vision, requirements, architecture, and key technologies,” IEEE Veh. Technol. Mag., vol. 14, no. 3, pp. 28–41, 2019.
  • [11] J. Zhang and K. B. Letaief, “Mobile edge intelligence and computing for the Internet of vehicles,” Proc. IEEE, vol. 108, no. 2, pp. 246–261, 2020.
  • [12] Y. Huang, C. Xu, C. Zhang, M. Hua, and Z. Zhang, “An overview of intelligent wireless communications using deep reinforcement learning,” J. Commun. Information Netw., vol. 4, no. 2, pp. 15–29, 2019.
  • [13] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over-the-air computation,” IEEE Trans. Wireless Commun., vol. 19, no. 3, pp. 2022–2035, 2020.
  • [14] Z. Zhou, X. Chen, E. Li, L. Zeng, K. Luo, and J. Zhang, “Edge intelligence: Paving the last mile of artificial intelligence with edge computing,” Proc. IEEE, vol. 107, no. 8, pp. 1738–1762, 2019.
  • [15] T. Jiang, X. Yang, Y. Shi, and H. Wang, “Layer-wise deep neural network pruning via iteratively reweighted optimization,” in Proc. IEEE Int. Conf. Acoustics Speech Signal Process. (ICASSP), pp. 5606–5610, IEEE, 2019.
  • [16] A. Ren, T. Zhang, S. Ye, J. Li, W. Xu, X. Qian, X. Lin, and Y. Wang, “Admm-nn: An algorithm-hardware co-design framework of dnns using alternating direction methods of multipliers,” in Proc. ASPLOS., pp. 925–938, 2019.
  • [17] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing of deep neural networks: A tutorial and survey,” Proc. IEEE, vol. 105, no. 12, pp. 2295–2329, 2017.
  • [18] G. Plastiras, M. Terzi, C. Kyrkou, and T. Theocharidcs, “Edge intelligence: Challenges and opportunities of near-sensor machine learning applications,” in 2018 IEEE 29th Int. Conf. on Appl.-specific Syst. Archit. and Processors (ASAP), pp. 1–7, IEEE, 2018.
  • [19] K. Yang, Y. Shi, and Z. Ding, “Data shuffling in wireless distributed computing via low-rank optimization,” IEEE Trans. Signal Process., vol. 67, no. 12, pp. 3087–3099, 2019.
  • [20] E. Li, L. Zeng, Z. Zhou, and X. Chen, “Edge AI: On-demand accelerating deep neural network inference via edge computing,” IEEE Trans. Wireless Commun., vol. 19, no. 1, pp. 447–457, 2020.
  • [21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vision Pattern Recognition (CVPR), pp. 770–778, 2016.
  • [22] S. Hua, Y. Zhou, K. Yang, and Y. Shi, “Reconfigurable intelligent surface for green edge inference,” arXiv preprint arXiv:1912.00820, 2019.
  • [23] K. Yang, Y. Shi, W. Yu, and Z. Ding, “Energy-efficient processing and robust wireless cooperative transmission for edge inference,” arXiv preprint arXiv:1907.12475, 2019.
  • [24] D. Gesbert, S. Hanly, H. Huang, S. S. Shitz, O. Simeone, and W. Yu, “Multi-cell MIMO cooperative networks: A new look at interference,” IEEE J. Sel. Areas Commun., vol. 28, no. 9, pp. 1380–1408, 2010.
  • [25] K. Li, M. Tao, and Z. Chen, “Exploiting computation replication for mobile edge computing: A fundamental computation-communication tradeoff study,” arXiv preprint arXiv:1903.10837, 2019.
  • [26] A. Y. Kruger, “On fréchet subdifferentials,” J. Math. Sci., vol. 116, no. 3, pp. 3325–3358, 2003.
  • [27] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski, “Optimization with sparsity-inducing penalties,” Found. Trends in Mach. Learn., vol. 4, no. 1, pp. 1–106, 2012.
  • [28] R. Jenatton, J.-Y. Audibert, and F. Bach, “Structured variable selection with sparsity-inducing norms,” J. Mach. Learn. Research, vol. 12, no. Oct, pp. 2777–2824, 2011.
  • [29] Y. Shi, J. Zhang, and K. B. Letaief, “Group sparse beamforming for green cloud-RAN,” IEEE Trans. Wireless Commun., vol. 13, no. 5, pp. 2809–2823, 2014.
  • [30] B. Dai and W. Yu, “Energy efficiency of downlink transmission strategies for cloud radio access networks,” IEEE J. Sel. Areas Commun., vol. 34, no. 4, pp. 1037–1050, 2016.
  • [31] L. Liu, E. G. Larsson, W. Yu, P. Popovski, C. Stefanovic, and E. De Carvalho, “Sparse signal processing for grant-free massive connectivity: A future paradigm for random access protocols in the Internet of things,” IEEE Signal Process. Mag., vol. 35, no. 5, pp. 88–99, 2018.
  • [32] Y. Shi, J. Dong, and J. Zhang, Low-overhead Communications in IoT Networks–Structured Signal Processing Approaches. Springer Singapore, 2020.
  • [33] O. Mehanna, N. D. Sidiropoulos, and G. B. Giannakis, “Joint multicast beamforming and antenna selection,” IEEE Trans. Signal Process., vol. 61, no. 10, pp. 2660–2674, 2013.
  • [34] E. J. Candes, M. B. Wakin, and S. P. Boyd, “Enhancing sparsity by reweighted 1\ell_{1} minimization,” J. Fourier Anal. Appl., vol. 14, no. 5-6, pp. 877–905, 2008.
  • [35] Y. Shi, J. Cheng, J. Zhang, B. Bai, W. Chen, and K. B. Letaief, “Smoothed p\ell_{p}-minimization for green cloud-RAN with user admission control,” IEEE J. Sel. Areas Commun., vol. 34, no. 4, pp. 1022–1036, 2016.
  • [36] P. Ochs, A. Dosovitskiy, T. Brox, and T. Pock, “On iteratively reweighted algorithms for nonsmooth nonconvex optimization in computer vision,” SIAM J. Imag. Sci., vol. 8, no. 1, pp. 331–372, 2015.
  • [37] C. Lu, Y. Wei, Z. Lin, and S. Yan, “Proximal iteratively reweighted algorithm with multiple splitting for nonconvex sparsity optimization.,” in Proc. 28th AAAI Conf. Artif. Intell., pp. 1251–1257, 2014.
  • [38] T. Sun, H. Jiang, and L. Cheng, “Global convergence of proximal iteratively reweighted algorithm,” J. Global Optimiz., vol. 68, no. 4, pp. 815–826, 2017.
  • [39] P.-A. Absil, R. Mahony, and B. Andrews, “Convergence of the iterates of descent methods for analytic cost functions,” SIAM J. Optimiz., vol. 16, no. 2, pp. 531–547, 2005.
  • [40] T.-J. Yang, Y.-H. Chen, J. Emer, and V. Sze, “A method to estimate the energy consumption of deep neural networks,” in Proc. 51st Asilomar Conf. Signals, Syst. Comput., pp. 1916–1920, IEEE, 2017.
  • [41] T.-J. Yang, Y.-H. Chen, and V. Sze, “Designing energy-efficient convolutional neural networks using energy-aware pruning,” in Proc. IEEE Conf. Comput. Vision Pattern Recognition (CVPR), pp. 5687–5695, 2017.
  • [42] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proc. IEEE Conf. Comput. Vision Pattern Recognition (CVPR), pp. 1–9, 2015.
  • [43] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138, 2016.
  • [44] “CNN Energy Estimation Website.” http://energyestimation.mit.edu.
  • [45] Y. Shen, Y. Shi, J. Zhang, and K. B. Letaief, “LORM: Learning to optimize for resource management in wireless networks with few training samples,” IEEE Trans. Wireless Commun., vol. 19, no. 1, pp. 665–679, 2020.
  • [46] A. Wiesel, Y. C. Eldar, and S. Shamai, “Linear precoding via conic optimization for fixed mimo receivers,” IEEE Trans. Signal Process., vol. 54, no. 1, pp. 161–176, 2005.
  • [47] R. T. Rockafellar, Convex Analysis, vol. 36. Princeton Univ. Press, 2015.
  • [48] H. Dahrouj and W. Yu, “Coordinated beamforming for the multicell multi-antenna wireless system,” IEEE Trans. Wireless Commun., vol. 9, no. 5, pp. 1748–1759, 2010.