Sparse Optimization for Green Edge AI Inference

Xiangyu Yang, Sheng Hua, Yuanming Shi*, Hao Wang, Jun Zhang, and Khaled B. Letaief X. Yang and S. Hua are with the School of Information Science and Technology, ShanghaiTech University, Shanghai 201210, China, also with the Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, Shanghai 200050, China, and also with the University of Chinese Academy of Sciences, Beijing 100049, China. (e-mail: {yangxy3, huasheng}@shanghaitech.edu.cn).Y. Shi and H. Wang are with the School of Information Science and Technology, ShanghaiTech University, Shanghai, China (e-mail: {shiym, wanghao1}@shanghaitech.edu.cn).J. Zhang is with the Department of Electronic and Information Engi- neering, The Hong Kong Polytechnic University, Hong Kong (e-mail: [email protected]).K. B. Letaief is with the Department of Electronic and Computer Engineer- ing, Hong Kong University of Science and Technology, Hong Kong (e-mail: [email protected]). He is also with Peng Cheng Laboratory, Shenzhen, China.*(Corresponding author: Yuanming Shi.)Part of this work was presented at the IEEE 90th Vehicular Technology Conference (VTC2019-Fall), Honolulu, Hawaii, USA, Sept. 2019 [1].

Abstract

With the rapid upsurge of deep learning tasks at the network edge, effective edge artificial intelligence (AI) inference becomes critical to provide low-latency intelligent services for mobile users via leveraging the edge computing capability. In such scenarios, energy efficiency becomes a primary concern. In this paper, we present a joint inference task selection and downlink beamforming strategy to achieve energy-efficient edge AI inference through minimizing the overall power consumption consisting of both computation and transmission power consumption, yielding a mixed combinatorial optimization problem. By exploiting the inherent connections between the set of task selection and group sparsity structural transmit beamforming vector, we reformulate the optimization as a group sparse beamforming problem. To solve this challenging problem, we propose a log-sum function based three-stage approach. By adopting the log-sum function to enhance the group sparsity, a proximal iteratively reweighted algorithm is developed. Furthermore, we establish the global convergence analysis and provide the ergodic worst-case convergence rate for this algorithm. Simulation results will demonstrate the effectiveness of the proposed approach for improving energy efficiency in edge AI inference systems.

Index Terms:

AI, edge inference, cooperative transmission, energy efficiency, group sparse beamforming, proximal iteratively reweighted algorithm.

I Introduction

THE availability of big data and computing power, along with the advances in the optimization algorithms, has triggered a booming era of artificial intelligence (AI). Notably, deep learning [2] is regarded as the most popular sector in modern AI and has achieved exciting breakthroughs in applications such as speech recognition, computer vision [3], etc. Benefiting from these achievements, AI is becoming a promising tool that streamlines people’s decision-making process and facilitates the development of diversified intelligence services (e.g., virtual personal assistant, recommendation system, etc). Meanwhile, with the proliferation of mobile computing as well as Internet-of-Things (IoT) devices, massive real-time data are generated locally [4]. However, it is widely acknowledged that traditional cloud-based computing [5, 6] faces challenges (e.g., latency, privacy and network congestion) for supporting the ubiquitous AI-empowered applications on mobile devices [7].

In contrast, edge AI is a promising approach, which can tackle the above concerns, via fusing mobile edge computing [8] with AI-enabled techniques (e.g., deep neural networks (DNNs)). By pushing AI models to the network edge, it brings the edge servers close to the requesting mobile devices and thus enables low-latency and privacy-preserving. Notably, edge AI is envisioned as the key ingredient of future intelligent $6$ G networks [9, 10, 11, 12], which fully unleashes the potentials for mobile communication and computation. Typically, edge AI consists of two phases of edge training and edge inference. As for the edge training phase, the training of AI models can be performed on cloud, edge or end devices [13], however, this is beyond the scope of this work. By deploying trained AI models and implementing model inference at network edge, this paper mainly focuses on edge inference. Following [7, 14], the edge AI inference architecture is generally classified into three major types:

•

On-device inference: It performs the model inference directly on end devices where DNN models are deployed. While some enabling techniques (e.g., model compression [15, 16], hardware speedup [17]) have been proposed to facilitate the deployment of the DNN model, it still poses challenges for resource-limited (e.g., memory, power budget and computation) end devices [18]. To mitigate such concerns, on-device distributed computing is envisioned as a promising solution for on-device inference, which enables AI model inference across multiple distributed end devices [19].
•

Joint device-edge inference: This mode carries out the AI model inference in a device-edge cooperation fashion [7] with the model partition and model early-exit techniques [20]. While device-edge cooperation is flexible and enables low correspondence-latency edge inference, it may still have high resource requirements for end devices due to the resource-demanding nature of DNNs [21].
•

Edge server inference: Such methods transfer the raw input data to edge serves for processing, which then return inference results to end-users [22, 23]. Edge server inference is particularly suitable for those computation-intensive tasks. Nonetheless, the inference performance relies mainly on the channel bandwidth between the edge server and end devices. Cooperative transmission [24] becomes promising for communication-efficient inference results delivery.

To support those computation-tasks on the resource-limited end devices, edge server inference stands out as a viable solution to fulfill the key performance requirements. The main focus of this paper is on the AI model inference for mobile devices with the edge server inference architecture. For the edge AI inference system, energy efficiency is a key performance indicator [14], which motives us to focus on the energy-efficient edge inference design. This is achieved by optimizing the overall network power consumption, including computation power consumption for performing inference tasks and transmission power consumption for returning inference results. In particular, cooperative transmission [24] is a widely recognized technique to reduce the downlink transmit power consumption and provide low-latency transmission services by exploiting the high beamforming gains for edge AI inference. In this work, we thus consider that multiple edge base stations (BSs) collaboratively transmit the inference results to the end devices [22]. To enable transmission cooperation, we apply the computation replication principle [25], i.e., the inference tasks from end devices can be performed by several neighboring edge BSs to create multiple copies of the inference results. However, computation replication greatly increases the power consumption in performing inference tasks. Therefore, it is necessary to select the inference tasks to be performed by each edge BS to achieve an optimal balance between communication and computation power consumption.

In this paper, we propose a joint inference task selection and downlink beamforming strategy towards achieving energy-efficient edge AI inference by optimizing the overall network power consumption consisting of the computation power consumption and the transport network power consumption under the quality-of-service (QoS) constraints. However, the resulting formulation contains combinatorial variables and nonconvex constraints, which makes it computationally intractable. To address this issue, we observe that the transmit beamforming vector has an intrinsic connection with the set of inference task selection (i.e., tasks are opted by edge servers to execute). Based on this crucial observation, we present a group sparse beamforming (GSBF) reformulation, followed by proposing a log-sum function based three-stage GSBF approach. In particular, in the first stage, we adopt a weighted log-sum function based relaxation to enhance the group sparsity of the structural solutions.

Nonetheless, the log-sum function minimization problem poses challenges in computation and analysis. To resolve the issues, we present a proximal iteratively reweighted algorithm, which solves a sequence of weighted convex subproblems. Moreover, we establish the global convergence analysis and worst-case convergence rate analysis of the presented proximal iteratively reweighted algorithm. Specifically, by leveraging the Fréchet subdifferential [26], we characterize the first-order necessary optimality conditions of the formulated convex-constrained log-sum problem. We then show that the generated iterates of the proposed algorithm make the function values steadily decrease and prove that any cluster point of the generated entire sequence is a critical point of the initial objective for any initial feasible point. Finally, we show that the defined optimality residual has $O(1/t)$ ergodic worst-case convergence rate, where $t$ is the iteration counter.

In the following, we summarize the major contributions of this paper as follows.

1.

We propose a joint task selection and downlink beamforming strategy to optimize the trade-off between computation and communication power consumption for an energy-efficient edge AI inference system. In particular, task selection is achieved by controlling the group sparsity structure of the transmit beamforming vector, thereby formulating a group sparse beamforming problem under the target QoS constraints.
2.

To solve the resulting optimization problem, we proposed a log-sum function based three-stage GSBF approach. In particular, we adopt a weighted log-sum approximation to enhance the group sparsity of the transmit beamforming vector in the first stage. Moreover, we propose a proximal iteratively reweighted algorithm to solve the log-sum minimization problem.
3.

For the presented proximal iteratively reweighted algorithm, we establish the global convergence analysis. We prove that every cluster point generated by the presented algorithm satisfies the first-order necessary optimality condition for the original nonconvex log-sum problem. Furthermore, a worst-case $O(1/t)$ convergence rate is established for this algorithm in an ergodic sense.
4.

Numerical experiments are conducted to demonstrate the effectiveness and competitive performance of the log-sum function based three-stage GSBF approach for designing the green edge AI inference system.

I-A Related Works

The study of inducing sparsity generally falls into the sparse optimization category [27, 28]. In particular, sparse optimization, emerging as a powerful tool, has recently contributed to the effective design of wireless networks, e.g., group sparse beamforming for energy-efficient cloud radio access networks [29, 30], and sparse signal processing for Internet-of-Things (IoT) networks [31, 32]. In particular, to induce the group sparsity structure of the beamforming vector, the work of [22, 23] adopted the mixed $\ell_{1,2}$ -norm. As illustrated in [27], the mixed $\ell_{1,q}$ -norms ( $q>1$ ) can induce the group sparsity structure of the interested solution. Moreover, the mixed $\ell_{1,2}$ -norm and $\ell_{1,\infty}$ -norm [33] are commonly adopted. However, the effectiveness of sparsity based on convex sparsity-inducing norms is not satisfactory since there always exists some small nonzero elements in the obtained solutions [34]. In contrast to these works, some works applied nonconvex sparsity-inducing functions to seek sparser solutions [35]. Notably, the work [34] reported the capability of log-sum function for enhancing the sparsity of the solutions.

Motivated by their superior performance on inducing sparsity, we adopt log-sum functions to promote the sparsity pattern in the solutions. However, adopting the log-sum function to enhance sparsity usually makes the problem difficult to compute and analyze. In [34], the authors first proposed an iteratively reweighted $\ell_{1}$ algorithm (IRL1) for tackling the nonconvex and nonsmooth log-sum functions with linear constraints. Nonetheless, they did not further conduct the convergence analysis for the proposed method. Under reasonable assumptions, the work of [36] established the convergence results for a class of unconstrained nonconvex nonsmooth problems based on the limiting-subgradient tool. In particular, these results could apply to the log-sum model in an unconstrained setting. In [37], they proposed a proximal iteratively reweighted algorithm and proved that any accumulation point is a critical point. The work of [38] further showed that, for any starting feasible points, the sequence generated by their proximal iteratively reweighted algorithm could converge to a critical point under the Kurdyka-Łojasiewicz (KL) property [39]. However, these works focused on the unconstrained formulation or linearly constrained cases when the log-sum model is involved. The theoretical analysis for the log-sum function with general convex-set constraints has not been investigated.

I-B Organization

The remainder of this paper is organized as follows. Section II presents the system model of the edge AI inference, followed by the problem formulation and analysis. Section II-C provides the group sparse beamforming formulation. The log-sum function based three-stage GSBF approach is proposed in Section III. Section IV provides the global convergence and convergence rate analysis of the proposed proximal iteratively reweighted algorithm. Section V demonstrates the performance of the proposed approach. The conclusion remark is made in Section VI. To keep the main text coherent and free of technical details, we divert most of the mathematic proofs to the Appendices.

I-C Notation

Throughout this paper, we subsume the notation used as follows. We use $\mathbb{C}^{n}$ and $\mathbb{R}^{n}$ to denote the complex vector space and the real Euclidean $n$ -space $\mathbb{R}^{n}$ , respectively. Boldface lower-case letters and upper case letters to represent vectors (e.g., $\bm{x}$ ) and matrices (e.g., $\bm{I}$ ) with an appropriate size, respectively. The inner product between $\bm{x},\bm{y}\in\mathbb{C}^{n}$ is denoted as $\langle\bm{x},\bm{y}\rangle$ . $\|\cdot\|_{1}$ and $\|\cdot\|_{2}$ is the conventionally defined $\ell_{1}$ -norm and $\ell_{2}$ -norm for any vectors in $\mathbb{C}^{n}$ , respectively. In addition, we use $(\cdot)^{\sf{H}}$ and $(\cdot)^{\sf{T}}$ to denote the Hermitian and transpose operators, respectively. $\mathfrak{R}(\cdotp)$ is the real part of a complex scalar. $\bm{1}$ is a vector with all components equal to 1 and $\bm{0}$ denotes the zero vector with an appropriate size. In particular, $\|\bm{v}\|_{\mathcal{G}_{2}}=\left[\|\bm{v}_{1}\|_{2},\cdots,\|\bm{v}_{n}\|_{2}\right]^{\sf{T}}\in\mathbb{R}^{n}$ represents a vector whose $n$ th element is the $\ell_{2}$ -norm of a structured vector $\bm{v}_{n}\in\mathbb{C}^{n}$ . We use $\circ$ to denote composition operation between two functions and symbol $\odot$ defines the elementwise product for any two vectors $\bm{x},\bm{y}\in\mathbb{C}^{n}$ .

For any closed convex set $\mathcal{C}\subset\mathbb{C}^{n}$ , we use $\delta_{\mathcal{C}}(\bm{c})$ to denote the characteristic function associated with $\mathcal{C}$ , which is defined as

\delta_{\mathcal{C}}(\bm{c}):=\left\{\begin{array}[]{lc}0,&\bm{c}\in\mathcal{C},\\ +\infty,&\bm{c}\notin\mathcal{C}.\end{array}\right.

Similarly, $\mathbb{I}(\cdot)$ defines a indicator function associated with the given condition $\cdot$ , i.e., if condition $\cdot$ is met, then return the value $1$ ; otherwise, return the value $0$ . Moreover, $z\sim\mathcal{CN}(\mu,\sigma^{2})$ corresponds to the complex random variable $z$ with mean $\mu$ and variance $\sigma^{2}$ .

II System Model and Problem Formulation

This section describes the overall system model and power consumption model for performing intelligent tasks in the considered edge AI inference system, followed by the problem formulation and analysis.

II-A System Model

We consider an edge computing system consisting of $N$ $L$ -antenna BSs collaboratively serving $K$ single-antenna mobile users (MUs), as illustrated in Fig. 1. These deployed BSs are used as dedicated edge nodes and have access to the enormous computation and storage resources [8]. For convenience, define $\mathcal{K}=\{1,\dots,K\}$ and $\mathcal{N}=\{1,\dots,N\}$ as the index sets of MUs and BSs, respectively. MUs have inference computing tasks, and the results can be inferred from task-related DNNs. For ease of expression, we use $\bm{d}_{k}$ to denote the raw input data collected from MU $k$ , and the corresponding inference results are represented as $\phi_{k}(\bm{d}_{k})$ . As performing intelligent tasks on DNNs are typically resource-demanding, it is usually impractical to perform the tasks on resource-constrained mobile devices locally. In the proposed edge AI inference system, by exploiting the computation replication [25], we consider the scenario that each neighboring edge BS has collected the raw input data $\{\bm{d}_{k}\}$ from all MUs. Then the edge BSs process the data $\{\bm{d}_{k}\}$ for model inference. After the edge BSs complete the model inference, the inference results $\{\phi_{k}(\bm{d}_{k})\}$ are returned to the corresponding MUs via the downlink channels. We assume that all edge BSs have been equipped with the pre-trained deep network models for all inference tasks [23].

In the downlink transmission, the edge BSs, which perform the inference tasks for the same MU cooperatively, return the inference results to the MU. We assume perfect channel state information (CSI) is available to all edge BSs to enable cooperative transmission for the inference results [24]. Let $\mathcal{A}_{n}\subseteq\mathcal{K}$ denote the indexes of MUs whose tasks are selectively performed by BS $n$ , and $\mathcal{A}=(\mathcal{A}_{1},\cdots,\mathcal{A}_{N})$ represents task selection strategy.

Refer to caption — Figure 1: System model illustration of the edge AI inference for intelligent tasks. This paper considers the scenario that each neighboring edge BS has collected the raw input data $\{\bm{d}_{k}\}$ from mobile users.

II-A1 Downlink Transmission Model

Let $s_{k}\in\mathbb{C}$ denote the encoded scalar of the requested output $\{\phi_{k}(\bm{d}_{k})\}$ for MU $k$ , and $\bm{v}_{nk}\in\mathbb{C}^{L}$ be the transmit beamforming vector at the BS $n$ for $s_{k}$ . For convenience, and without loss of generality, we assume that $\mathbb{E}(|s_{k}|^{2})=1$ , i.e., the power of $s_{k}$ is normalized to the unit. The transmitted signal $\bm{x}_{n}$ at BS $n$ can be expressed as

\bm{x}_{n}=\sum_{k\in\mathcal{A}_{n}}\bm{v}_{nk}s_{k}.

(1)

Let $\bm{h}_{nk}\in\mathbb{C}^{L}$ be the propagation channel coefficient vector between BS $n$ and MU $k$ . The received signal at MU $k$ denoted as $y_{k}$ , is then given by

$\displaystyle y_{k}$	$\displaystyle=\sum_{n\in\mathcal{N}}\bm{h}_{nk}^{\sf{H}}\bm{x}_{n}+z_{k}$	(2)
	$\displaystyle=\sum_{n\in\mathcal{N}}\bm{h}_{nk}^{\sf{H}}\sum_{l\in\mathcal{A}_{n}}\bm{v}_{nl}s_{l}+z_{k}$
	$\displaystyle=\sum_{n\in\mathcal{N}}\bm{h}_{nk}^{\sf{H}}\!\left[\mathbb{I}(k\in\mathcal{A}_{n})\bm{v}_{nk}s_{k}\!+\!\sum_{l\in\mathcal{A}_{n},l\neq k}\!\bm{v}_{nl}s_{l}\right]+z_{k},$

where $z_{k}\sim\mathcal{CN}(0,\sigma^{2}_{k})$ is the isotropic additive white Gaussian noise.

We assume that all data symbols $s_{k}$ are mutually independent of each other as well as noise. Based on (2), the signal-to-interference-plus-noise ratio (SINR) for MU $k$ is therefore given as

\textrm{SINR}_{k}(\mathcal{A})=\frac{|\sum_{n\in\mathcal{N}}\mathbb{I}(k\in\mathcal{A}_{n})\bm{h}_{nk}^{\sf{H}}\bm{v}_{nk}|^{2}}{\sum_{l\neq k}|\sum_{n\in\mathcal{N}}\mathbb{I}(l\in\mathcal{A}_{n})\bm{h}_{nk}^{\sf{H}}\bm{v}_{nl}|^{2}+\sigma_{k}^{2}}.

(3)

II-A2 Power Consumption Model

The computation and transmission power consumption for model inference is generally large. Energy efficiency is of significant importance for an energy-efficient edge AI inference system design, for which the overall network power consumed in computation and communication at the edge BSs becomes our main interest. Specifically, we express the total transmission power for all edge BSs in the downlink as

	$\displaystyle P_{\textrm{trans}}(\mathcal{A},\{\bm{v}_{nk}\}):$	$\displaystyle=\sum_{n=1}^{N}\frac{1}{\eta_{n}}\mathbb{E}\bigg{[}\sum_{k\in\mathcal{A}_{n}}\\|\bm{v}_{nk}s_{k}\\|_{2}^{2}\bigg{]}$		(4)
		$\displaystyle=\sum_{n=1}^{N}\sum_{k\in\mathcal{A}_{n}}\frac{1}{\eta_{n}}\\|\bm{v}_{nk}\\|_{2}^{2},$		(4)

where $\eta_{n}$ is the radio frequency power amplifier efficiency coefficient of edge BS $n$ .

In addition to the downlink transmission power consumption, the power consumed in performing AI inference tasks should be taken into consideration as well, owing to the power-demanding nature of running DNNs. We use $P_{nk}^{\sf{c}}$ to denote the computation power consumption of the BS $n$ in performing inference task $\phi_{k}$ . Then the computation power consumed by all BSs are given by

P_{\textrm{comp}}(\mathcal{A}):=\sum_{n\in\mathcal{N}}\sum_{k\in\mathcal{A}_{n}}P_{nk}^{\sf{c}}.

(5)

For the estimation of the computation energy consumption in executing task $\phi_{k}$ therein, the works [40, 41] stated that the energy consumption of a deep neural network layer for inference mainly including computation energy consumption and data movement energy consumption. For illustration, we take GoogLeNet v1 [42] as a concrete example to illustrate the energy consumed by performing inference tasks. Specifically, we use GoogLeNet v1 to perform image classification tasks on the Eyeriss chip [43]. With the help of an energy estimation online tool [44], we are able to visualize the energy consumption breakdown of the GoogLeNet v1, as illustrated in Fig. 2. We obtain the estimation of the computation power consumption via dividing the total energy consumption by the computation time. In particular, the computation time is determined by the total number of multiplication-and-accumulation (MAC) operations and the peak throughout of Eyeriss chip.

Therefore, the overall power consumption for edge AI inference, including transmission and computation power consumption, is calculated as

	$\displaystyle P_{\textrm{overall}}(\mathcal{A},\{\bm{v}_{nk}\})$	$\displaystyle\!=\!P_{\textrm{trans}}(\mathcal{A},\{\bm{v}_{nk}\})+P_{\textrm{comp}}(\mathcal{A})$		(6)
		$\displaystyle=\!\sum_{n\in\mathcal{N}}\!\sum_{k\in\mathcal{A}_{n}}\!\frac{1}{\eta_{n}}\\|\bm{v}_{nk}\\|_{2}^{2}\!+\!\!\sum_{n\in\mathcal{N}}\!\sum_{{k\in\mathcal{A}_{n}}}\!P_{nk}^{\sf{c}}.$		(6)

II-B Problem Formulation and Analysis

Note that there is a fundamental trade-off between transmission and computation power consumption. To be specific, more edge BSs performing the same task for MUs can significantly reduce the transmission power by exploiting higher transmit beamforming gains. However, this inevitably increases the computation power consumption for performing inference tasks. Therefore, the goal of an energy-efficient edge inference system can be achieved by minimizing the overall network power consumption to reach a balance between these two parts of power consumption.

Let $\{\gamma_{k}\}$ be the target SINR for MUs to receive the reliable AI inference results in the downlink successfully. In our proposed energy-efficient edge AI inference system, the overall power minimization problem is thus formulated as

$\displaystyle\min_{\begin{subarray}{c}\mathcal{A},\{\bm{v}_{nk}\}\end{subarray}}{}$	$\displaystyle P_{\textrm{overall}}(\mathcal{A},\{\bm{v}_{nk}\})$	(7)
s.t.	$\displaystyle\text{SINR}_{k}(\mathcal{A})\geq\gamma_{k},~{}\forall\,k\in\mathcal{K},$
	$\displaystyle\sum_{k\in\mathcal{K}}\\|\bm{v}_{nk}\\|_{2}^{2}\leq P_{n}^{\sf{max}},~{}\forall\,n\in\mathcal{N},$

where $P_{n}^{\sf{max}}>0$ denotes the maximum transmit power of edge BS $n$ .

Unfortunately, problem (7) turns out to be a mixed combinatorial optimization problem due to the presence of combinatorial variable $\mathcal{A}$ , which makes it computationally intractable [45]. On the other hand, the nonconvex SINR constraints also pose troublesome challenges for solving (7). To address these issues, we recast problem (7) into a tractable formulation by inducing the group sparsity of the beamforming vector in the following section.

II-C A Group Sparse Beamforming Representation Framework

One naive approach to cope with the combinatorial variable $\mathcal{A}$ is the exhaustive search. However, it is often computationally prohibitive owing to the exponential complexity. As a practical alternative, there is a critical observation that such a combinatorial variable $\mathcal{A}$ can be eliminated by exploiting the inherent connection between task selection and the group sparsity structure of beamforming vectors. Specifically, if edge BS $n$ does not perform the inference tasks from MU $k$ (i.e., $k\notin\mathcal{A}_{n}$ ), then it will not deliver the inference result $\phi_{k}(\bm{d}_{k})$ in the downlink transmission (i.e., $\bm{v}_{nk}=\bm{0}$ ). In other words, if $k\notin\mathcal{A}_{n}$ , all coefficients in the beamforming vector $\bm{v}_{nk}$ are zero simultaneously. Mathematically, we have $\mathcal{A}_{n}=\{k\mid\bm{v}_{nk}\neq\bm{0},~{}k\in\mathcal{K}\}$ , for all $n\in\mathcal{N}$ , meaning the task selection strategy $\mathcal{A}$ can be uniquely determined by the group sparsity structure of $\bm{v}_{nk}$ . In this respect, the overall network power consumption problem (7) can rewritten as

P_{\textrm{sparse}}(\{\bm{v}_{nk}\})\!=\!\sum_{n=1}^{N}\!\sum_{k=1}^{K}\!\frac{1}{\eta_{n}}\|\bm{v}_{nk}\|_{2}^{2}\!+\!\sum_{n=1}^{N}\!\sum_{k=1}^{K}\mathbb{I}(\bm{v}_{nk}\neq\bm{0})P_{nk}^{\sf{c}}.

(8)

By considering the sparsity structure in the beamforming vectors, the SINR expression (3) is transformed into

	$\displaystyle\text{SINR}_{k}$	$\displaystyle=\frac{\|\sum_{n\in\mathcal{N}}\bm{h}_{nk}^{\sf{H}}\bm{v}_{nk}\|^{2}}{\sum_{l\neq k}\|\sum_{n\in\mathcal{N}}\bm{h}_{nk}^{\sf{H}}\bm{v}_{nl}\|^{2}+\sigma^{2}_{k}}$		(9)
		$\displaystyle=\frac{\|\bm{h}_{k}^{\sf{H}}\bm{v}_{k}\|^{2}}{\sum_{l\neq k}\|\bm{h}_{k}^{\sf{H}}\bm{v}_{l}\|^{2}+\sigma^{2}_{k}},~{}\,\forall\,k\in\mathcal{K},$		(9)

where $\bm{h}_{k}=\left[\bm{h}_{1k}^{\sf{H}},\dots,\bm{h}_{Nk}^{\sf{H}}\right]^{\sf{H}}$ and $\bm{v}_{k}\!=\!\left[\bm{v}_{1k}^{\sf{T}},\dots,\bm{v}_{Nk}^{\sf{T}}\right]^{\sf{T}}$ are the aggregated channel vector and downlink transmit beamforming vector for MU $k$ , respectively.

On the other hand, since an arbitrary phase rotation of the transmit beamforming vectors $\{\bm{v}_{nk}\}$ does not affect the downlink SINR constraints and the objective function value, we can always find proper phases to equivalently transform the SINR constraints in (7) into convex second-order cone constraints [46]. We thus have the following convex-constrained sparse optimization framework for network power minimization

$\displaystyle\min_{\{\bm{v}_{nk}\}}{}$	$\displaystyle P_{\textrm{sparse}}(\{\bm{v}_{nk}\})$	(10)
s.t.	$\displaystyle\sum_{k\in\mathcal{K}}\\|\bm{v}_{nk}\\|_{2}^{2}\leq P_{n}^{\sf{max}},~{}\forall n\in\mathcal{N},$
	$\displaystyle\sqrt{\sum_{l\neq k}\!\|\bm{h}_{k}^{\sf{H}}\bm{v}_{l}\|^{2}+\!\sigma^{2}_{k}}\!\leq\frac{1}{\sqrt{\gamma_{k}}}\mathfrak{R}(\bm{h}_{k}^{\sf{H}}\bm{v}_{k}),~{}\forall k\in\mathcal{K}.$

However, problem (10) is still nonconvex due to the indicator function in the objective function. As presented in [29, Proposition 1], a weighted mixed $\ell_{1,2}$ -norm can be served as the tightest convex surrogate of the objective in (10), i.e.,

P(\{\bm{v}_{nk}\})=2\sum_{n=1}^{N}\sum_{k=1}^{K}\sqrt{P_{nk}^{\sf{c}}/\eta_{n}}\|\bm{v}_{nk}\|_{2}.

(11)

In this paper, we instead propose to adopt a new group sparsity inducing function for inference tasks selection via enhancing sparsity, thereby further reducing the network power consumption.

III A Los-sum Function Based Three-stage Group Sparse Beamforming Framework

In this section, we shall propose to adopt the log-sum function to enhance the group sparsity of the beamforming vector, followed by describing the log-sum function based three-stage GSBF approach. In particular, we propose a proximal iteratively reweighted algorithm to address the log-sum minimization problem in the first stage.

III-A Log-sum Function for Enhancing Group Sparsity

Let $\bm{v}=\!\left[\bm{v}_{11}^{\sf{T}},\cdots,\bm{v}_{1K}^{\sf{T}},\cdots,\bm{v}_{N1}^{\sf{\sf{T}}},\cdots,\bm{v}_{NK}^{\sf{T}}\right]^{\sf{T}}\!\in\!\mathbb{C}^{LNK}$ denote the aggregated beamforming vector $\bm{v}$ . To promote the group sparsity for the beamforming vector $\bm{v}_{nk}$ , in this paper, we propose to use the following weighted nonconvex log-sum function as an approximation for the objective $P_{\textrm{sparse}}(\bm{v})$

\Omega(\bm{v}):=\sum_{n=1}^{N}\sum_{k=1}^{K}\rho_{nk}\log(1+p\|\bm{v}_{nk}\|_{2}),

(12)

where $\rho_{nk}=\sqrt{P^{\sf{c}}_{nk}/\eta_{n}}>0$ is a weight coefficient and $p>0$ is a tunable parameter. The main motivation for adopting such a log-sum penalty among various types of sparsity-inducing functions [27] is based on the following considerations:

•

The mixed $\ell_{1,2}$ -norm is similar as an $\ell_{1}$ -norm of vector $\bm{v}$ and thereof offers the tightest convex relaxation to the $\ell_{0}$ -norm. In contrast to the mixed $\ell_{1,2}$ -norm, it has been reported that the log-sum function can significantly enhance the sparsity of the solution than the conventional $\ell_{1}$ -norm [27, 34].
•

From the perspective of performance and theoretical analysis of the designed algorithm, a log-sum function brings more practicability due to its coercivity and boundedness of its first derivative.

III-B A Log-sum Function Based Three-stage Group Sparse Beamforming Approach

We present the proposed log-sum based three-stage GSBF framework. Specifically, the first stage is to solve the log-sum convex-constrained problem via the proposed proximal iteratively reweighted algorithm to obtain a solution $\bm{v}_{\textrm{sparse}}$ ; the second stage prioritizes the tasks in progress based on the obtained solution $\bm{v}_{\textrm{sparse}}$ and system parameters, followed by obtaining the optimal task selection strategy $\mathcal{A}^{*}$ ; with fixed $\mathcal{A}^{*}$ , we refine the $\bm{v}$ in the third stage. Details are depicted as follows.

Stage 1: Log-sum Function Minimization. In this first stage, we obtain the group sparsity structure of beamformer $\bm{v}$ by solving the following nonconvex program

\displaystyle\begin{aligned} \min_{\bm{v}}~{}\Omega(\bm{v})\quad\textrm{s.t.}\quad\bm{v}\in\mathcal{C},\end{aligned}

(13)

where $\mathcal{C}$ denotes the convex-constraints in (10).

However, the nonconvex and nonsmooth objective in (13) and the presence of the convex constraints usually pose challenges in computation and analysis. Inspired by the work of [34], we can iteratively minimize the objective by solving a sequence of tractable convex subproblems. The main idea of our presented algorithm is to solve a well-constructed convex surrogate subproblem instead of directly solving the original nonconvex problem.

Let $f_{p}(\bm{v}_{nk}):=\log(1+p\|\bm{v}_{nk}\|_{2})$ . First observe that $f_{p}(\bm{v}_{nk})$ is a composite function with $\bm{z}(\bm{v}_{nk})=\|{\bm{v}}_{nk}\|_{2}$ convex and $f_{p}(\bm{z})=\log(1+pz_{nk})$ nonconvex. At the $i$ th iterate $\bm{v}_{nk}^{[i]}$ , for any feasible $\tilde{\bm{v}}_{nk}$ , we have

$\displaystyle f_{p}(\bm{z}(\tilde{\bm{v}}_{nk}))$	$\displaystyle\leq f_{p}(\bm{z}(\bm{v}_{nk}^{[i]}))+\langle\bm{w}^{[i]},\bm{z}(\tilde{\bm{v}}_{nk})-\bm{z}(\bm{v}_{nk}^{[i]})\rangle$	(14)
	$\displaystyle\leq f_{p}(\bm{z}(\bm{v}_{nk}^{[i]}))+\langle\bm{w}^{[i]},\bm{z}(\tilde{\bm{v}}_{nk})-\bm{z}(\bm{v}_{nk}^{[i]})\rangle$
	$\displaystyle\quad+\frac{\beta}{2}\\|\tilde{\bm{v}}_{nk}-\bm{v}_{nk}^{[i]}\\|_{2}^{2},$

where $\bm{w}^{[i]}\in\partial(f_{p}(\bm{z}(\bm{v}_{nk}^{[i]})))$ is the subgradient of $f_{p}(\bm{z})$ at $\bm{z}=\bm{z}(\bm{v}_{nk}^{[i]})$ and $\beta>0$ is the prescribed proximity parameter, and the first inequality holds by the definition of the subgradient of the convex function. Hence, a convex subproblem is derived as an approximation of $f_{p}(\bm{v}_{nk})$ at current iterate $\bm{v}_{nk}^{[i]}$ , which reads

	$\displaystyle\min_{\{\bm{v}_{nk}\}}$	$\displaystyle\sum_{n=1}^{N}\sum_{k=1}^{K}w_{nk}^{[i]}\\|\bm{v}_{nk}\\|_{2}+\frac{\beta}{2}\sum_{n=1}^{N}\sum_{k=1}^{K}\\|\bm{v}_{nk}-\bm{v}_{nk}^{[i]}\\|_{2}^{2}$		(15)
	s.t.	$\displaystyle\bm{v}\in\mathcal{C}$		(15)

with weights

w_{nk}^{[i]}=\rho_{nk}\cdot\partial(f_{p}(\bm{z}(\bm{v}_{nk}^{[i]})))=\frac{p\rho_{nk}}{{p\|\bm{v}_{nk}^{[i]}\|_{2}}+1}.

(16)

As presented in [34], a smaller $\|\bm{v}_{nk}^{[i]}\|_{2}$ causes larger $w_{nk}^{[i]}$ , then drive the nonzero components of $\bm{v}_{nk}$ towards zero aggressively. Overall, to enhance the group sparsity structure of the beamforming vector, the proposed proximal iteratively reweighted algorithm is illustrated in Algorithm 1.

1: Input:

p>0

\beta>0

, IterMax and

\bm{v}^{[0]}

2: Initialization:

\bm{w}^{[0]}=\bm{1}

and set

i=0

3: while not converge or not attain IterMax do

4: (Solve the reweighted subproblem for

\bm{v}^{[i+1]}

) Calculate

\bm{v}^{[i+1]}

according to (15).

5: (Reweighting) Update weight

w_{nk}^{[i]}

according to (16).

6: Set

i\leftarrow i+1

7: end while

8: Output:

\bm{v}_{\textrm{sparse}}=\bm{v}^{[i]}

Algorithm 1 A Proximal Iteratively Reweighted Algorithm for Solving (13)

Stage 2: Tasks Selection. In this second stage, an ordering guideline is applied to determine the priority of inference tasks, which is guided by the solution obtained in Stage 1. For ease of notation, let $\mathcal{S}=\{(n,k)\mid n\in\mathcal{N},k\in\mathcal{K}\}$ denote the set of all tasks. By considering the key system parameters (e.g., $\bm{h}_{nk}$ , $P_{nk}^{\sf{c}}$ and $\eta_{n}$ ), the priority of task $(n,k)$ is heuristically given as

\theta_{nk}=\sqrt{\frac{\|\bm{h}_{nk}\|_{2}^{2}\eta_{n}}{{P}_{nk}^{\sf{c}}}}\|\bm{v}_{nk}\|_{2}.

(17)

Intuitively, if edge BS $n$ is with a lower aggregative beamformer gain, lower power amplifier efficiency, lower channel power gain, but a higher computation power consumption for MU $k$ , task $(n,k)$ has a lower priority. A lower $\theta_{nk}$ indicates that the tasks from MU $k$ have lower priority and may not be performed by BS $n$ . Thus, tasks are arranged in light of the rule (17) with descending order. That is, the task’s priority is $\theta_{\pi_{1}}\geq\theta_{\pi_{2}}\dots\geq\theta_{\pi_{NK}}$ , where $\pi$ denotes the permutation of task indexes.

We then solve a sequence of convex feasibility detection problems to obtain task selection strategy $\mathcal{A}$ ,

\displaystyle\mathop{\textrm{find}}~{}\bm{v}\quad\textrm{s.t.}\quad\bm{v}_{\pi^{(t)}}=\bm{0},~{}\bm{v}\in\mathcal{C},

(18)

where $\pi^{(t)}=\{\pi_{t+1},\cdots,\pi_{NK}\}$ and $t$ increases from $K$ to $NK$ until (18) is feasible. Here $\bm{v}_{\pi^{(t)}}=\bm{0}$ are convex constraints, meaning that all $\bm{v}_{nk}$ ’s coefficients are zeros for task $(n,k)\in\pi^{(t)}$ . The support set of beamformer $\bm{v}$ is defined as $\mathcal{T}(\bm{v})=\{(n,k)\mid\|\bm{v}_{nk}\|_{2}\neq 0,n\in\mathcal{N},k\in\mathcal{K}\}$ , then the optimal task selection strategy $\mathcal{A}^{*}$ can be derived from $\mathcal{T}(\bm{v})=\{\pi_{1},\dots,\pi_{t}\}$ .

Stage 3: Solution Refinement. At this point, we have determined tasks selection for each BS. Then, fix the obtained task index set, we solve the following convex program to refine the beamforming vectors

$\displaystyle\min_{\bm{v}}$	$\displaystyle\sum_{n=1}^{N}\sum_{k=1}^{K}\frac{1}{\eta_{n}}\\|\bm{v}_{nk}\\|_{2}^{2}+\sum_{n=1}^{N}\sum_{k\in\mathcal{A}^{*}}P_{nk}^{\sf{c}}$	(19)
s.t.	$\displaystyle\bm{v}_{\pi^{(t)}}=\bm{0},$
	$\displaystyle\bm{v}\in\mathcal{C}.$

Overall, our proposed log-sum function based three-stage GSBF framework for solving (7) can be presented in Algorithm 2.

1: Stage 1: Log-sum function minimization. Call Algorithm 1 to obtain

\bm{v}_{\textrm{sparse}}

2: Stage 2: Tasks selection.

3: Sort

\theta_{nk}

’s in descending order by (17).

4: Set

t=K

5: while not feasible do

6: (Determine the optimal task selection) Calculate (18) with

\pi^{(t)}=\{\pi_{t+1},\dots,\pi_{NK}\}

7: Set

t\leftarrow t+1

8: end while

9: Stage 3: Solution refinement.

10: Refine

\{\bm{v}_{nk}\}

by solving (19).

11: Obtain

\{\bm{v}_{nk}\}

and

\mathcal{A}^{*}

Algorithm 2 A Log-sum Function Based Three-stage GSBF Framework for Solving (7)

IV Global Convergence Analysis

In this section, we provide the global convergence for Algorithm 1. Specifically, we derive the first-order necessary optimality condition to characterize the optimal solutions. We then establish convergence results for a subsequence of the sequence generated by Algorithm 1. Furthermore, we show that for any initial feasible point, the entire sequence must have cluster points, and any cluster point satisfies the established first-order optimality condition. Finally, the ergodic worst-case convergence rate of the optimality residual is derived.

IV-A First-order Necessary Optimality Condition

In this subsection, we derive the first-order necessary conditions to characterize the optimal solution of (13). Problem (13) is equivalently rewritten as

\min_{\bm{v}}~{}J(\bm{v}):=\Omega(\bm{v})+\delta_{\mathcal{C}}(\bm{v}).

(20)

Similarly, for the derived subproblem (15), we have

\displaystyle\min_{\bm{v}}~{}\!G(\bm{v};\bm{v}^{[i]}):=\!\sum_{n=1}^{N}\!\sum_{k=1}^{K}\!w_{nk}^{[i]}\|\bm{v}_{nk}\|_{2}+\frac{\beta}{2}\|\bm{v}-\bm{v}^{[i]}\|_{2}^{2}\!+\!\delta_{\mathcal{C}}(\bm{v}).

(21)

Due to the nonconvex and nonsmooth nature of the log-sum function, we make use of the Fréchet subdifferential as the major tool in our analysis. Its definition is introduced as follows.

Definition 1 (Fréchet subdifferential [26])

Let $\mathcal{X}$ be a real Banach space and $\mathcal{X}^{*}$ denotes the corresponding topological dual and $f$ be a function from $\mathcal{X}$ into an extended real line $\bar{\mathbb{R}}=\mathbb{R}\cup\{+\infty\}$ , finite at $\bm{r}$ . A set

\partial_{F}f(\bm{r})\!=\!\left\{\bm{r}^{*}\!\in\!\mathcal{X}^{*}|\liminf_{\bm{u}\to\bm{r}}\frac{f(\bm{u})-f(\bm{r})\!-\!\langle\bm{r}^{*},\bm{u}-\bm{r}\rangle}{\|\bm{u}\!-\!\bm{r}\|_{2}}\geq 0\right\}

is called a Fréchet subdifferential of $f$ at $\bm{r}$ . Its elements are referred to as Fréchet subgradients.

Several important properties of the Fréchet subdifferential [26] are listed below, which are used to characterize the optimal solution of (13).

Proposition 1

Let $X$ be a closed and convex set. Then the following properties on Fréchet subdifferentials holds true.

$(i)$

If $f$ is Fréchet subdifferentiable at $\bm{x}$ and attains local minimum at $\bm{x}$ , then $\bm{0}\in\partial_{F}f(\bm{x})$ .
$(ii)$

Let $h(\cdot)$ be Fréchet subdifferentiable at $\bar{z}=z(\bar{\bm{x}})$ with $z(\bm{x})$ being convex, then $h\circ z(\bm{x})$ is Fréchet subdifferentiable at $\bar{\bm{x}}$ such that

$\bar{y}\partial z(\bar{\bm{x}})\subset\partial_{F}h\circ z(\bar{\bm{x}})$

for any $\bar{y}\in\partial_{F}h(\bar{z})$ .
$(iii)$

$\mathcal{N}_{X}(\bm{x})=\partial_{F}\delta_{X}(\bm{x})$ with closed and convex sets $X$ .

The following Fermat’s rule [47] describes the necessary optimality condition of problem (13).

Theorem 1 (Fermat’s rule)

If (20) attains a local minimum at $\bm{v}$ , then it holds true that

\bm{0}\in\partial_{F}J(\bm{v}):=\partial_{F}\Omega(\bm{v})+\mathcal{N}_{\mathcal{C}}(\bm{v}).

(22)

We next investigate the properties of $\partial_{F}\Omega(\bm{v})$ in the following Proposition 2, indicating that the Fréchet subdifferentials of $f_{p}(\cdot)$ at $\bm{0}$ is bounded.

Proposition 2

If $\bm{v}_{nk}\neq\bm{0}$ , then $\gamma\partial\bm{z}(\bm{v}_{nk})\subset\partial_{F}f_{p}\circ\bm{z}(\bm{v}_{nk})$ for any $\gamma\in\partial_{F}f_{p}(\bm{z})$ . In particular, $\partial_{F}f_{p}\circ\bm{z}(\bm{0})$ is any element of $\{\bm{y}\in\mathbb{C}^{L}\mid\|\bm{y}\|_{2}\leq p\}$ .

To explore the behavior of the proposed proximal iteratively reweighted algorithm, based on Theorem 1 and Proposition 2, we define the optimality residual associated with (20) at a point $\bm{v}^{[i]}$ as

\bm{r}^{[i]}:=\bm{w}^{[i]}\odot\bm{x}^{[i]}+\bm{u}^{[i]},

(23)

where $\bm{x}^{[i]}\in\partial\|\bm{v}^{[i]}\|_{\mathcal{G}_{2}}=\left[(\partial\|\bm{v}_{11}^{[i]}\|_{2})^{\sf{T}},\cdots,(\partial\|\bm{v}_{NK}^{[i]}\|_{2})^{\sf{T}}\right]^{\sf{T}}$ and $\bm{u}^{[i]}\in\mathcal{N}_{\mathcal{C}}(\bm{v}^{[i]})$ . Since $\bm{r}^{[i]}\in\partial_{F}J(\bm{v}^{[i]})$ , it implies that if $\bm{r}^{[i]}=\bm{0}$ then $\bm{v}^{[i]}$ satisfies the first-order necessary optimality condition (22). We adopt $\bm{r}^{[i]}$ to measure the convergence rate of our algorithm.

Moreover, we provide the first-order optimality condition of the subproblem (21) as follows

\displaystyle\bm{0}=\partial G(\bm{v};\bm{v}^{[i]})=\beta(\bm{v}^{[i+1]}-\bm{v}^{[i]})+\bm{w}^{[i]}\odot\bm{x}^{[i+1]}+\bm{u}^{[i+1]},

(24)

where $\bm{v}^{[i+1]}=\operatornamewithlimits{arg\,min}_{\bm{v}}G(\bm{v};\bm{v}^{[i]})$ , $\bm{x}^{[i+1]}\in\partial\|\bm{v}^{[i+1]}\|_{\mathcal{G}_{2}}$ and $\bm{u}^{[i+1]}\in\mathcal{N}_{\mathcal{C}}(\bm{v}^{[i+1]})$ . Note that the existence of optimal solution to (21) simply follows from the convexity and the coercivity of the objective $G(\cdot;\bm{v}^{[i]})$ .

Now we show that an optimal solution of (21) also satisfies the first-order necessary optimality condition of (20) in the following lemma.

Lemma 1

$\bm{v}^{[i]}$ satisfies the first-order necessary optimality condition of (20) if and only if

\bm{v}^{[i]}=\operatornamewithlimits{arg\,min}_{\bm{v}}G(\bm{v};\bm{v}^{[i]}).

Proof:

Please refer to Appendix A for details. ∎

Define the model reduction caused by $\bm{v}^{[i+1]}$ at a point $\bm{v}^{[i]}$ as

\Delta G(\bm{v}^{[i+1]};\bm{v}^{[i]}):=G(\bm{v}^{[i]};\bm{v}^{[i]})-G(\bm{v}^{[i+1]};\bm{v}^{[i]}).

(25)

The new iterate $\bm{v}^{[i+1]}$ causes a decrease in the objective $J(\bm{v})$ , and this model reduction (25) converges to zero in the limit, both results are revealed in the following Lemma 2.

Lemma 2

Suppose $\{\bm{v}^{[i]}\}_{i=0}^{\infty}$ is generated by of Algorithm 1 with $\bm{v}^{[0]}\in\mathcal{C}$ . The following statements hold true

$(i)$

$J(\bm{v}^{[i+1]})-J(\bm{v}^{[i]})\leq G(\bm{v}^{[i+1]};\bm{v}^{[i]})-G(\bm{v}^{[i]};\bm{v}^{[i]})\leq 0$ .
$(ii)$

$\lim\limits_{i\to\infty}G(\bm{v}^{[i+1]};\bm{v}^{[i]})=0$ .
$(iii)$

$G(\bm{v};\bm{v}^{[i]})$ is monotonically decreasing. Indeed,

$\Delta G(\bm{v}^{[i+1]};\bm{v}^{[i]})\geq\frac{\beta}{2}\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\|_{2}^{2}.$

Proof:

Please refer to Appendix B for details. ∎

We now provide our main result in the following Theorem 2.

Theorem 2

Suppose $\{\bm{v}^{[i]}\}_{i=0}^{\infty}$ is generated by Algorithm 1 with $\bm{v}^{[0]}\in\mathcal{C}$ . It holds true that $\{\bm{v}^{[i]}\}$ must be bounded and any cluster point of $\{\bm{v}^{[i]}\}$ satisfies the first-order necessary optimality condition of (20).

Proof:

Please refer to Appendix C for details. ∎

IV-B Ergodic Worst-case Convergence Rate

In this subsection, we show that the presented proximal iteratively reweighted algorithm has $O(1/t)$ ergodic worst-case convergence rate in terms of the optimality residual. In the following Lemma, it states that the optimality residual has an upper bound with the displacement of the iterates.

Lemma 3

The optimality residual associated with problem (20) satisfies

\|\bm{r}^{[i+1]}\|_{2}^{2}\leq(\beta^{2}+2\beta\kappa p^{2}+\kappa^{2}p^{4})\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\|_{2}^{2}

with $\kappa=\rho^{\max}$ ¹¹1 $\rho^{\max}$ denotes the maximum elements among $\left[\rho_{nk}\right]$ for all $n\in\mathcal{N}$ , $k\in\mathcal{K}$ ..

Proof:

Please refer to Appendix D for details. ∎

The subproblem (21) is referred to as the primal problem, and by exploiting the conjugate function [47], the associated Fenchel-Rockafellar dual is constructed as

	$\displaystyle\max_{\bm{\lambda},\bm{\mu}}$	$\displaystyle Q(\bm{\lambda},\bm{\mu};\bm{v}^{[i]})$		(26)
	s.t.	$\displaystyle\\|\bm{\lambda}_{nk}\\|_{2}\leq 1,\quad\forall n\in\mathcal{N},k\in\mathcal{K},$		(26)

where the dual objective is given as $Q(\bm{\lambda},\bm{\mu};\bm{v}^{[i]})=-\frac{1}{2\beta}\|\bm{\lambda}\odot\bm{w}^{[i]}+\bm{\mu}-\beta\bm{v}^{[i]}\|_{2}^{2}+\frac{\beta}{2}\|\bm{v}^{[i]}\|_{2}^{2}-\delta_{\mathcal{C}}^{*}(\bm{\mu})$ , and the technical details to construct (26) is provided in Appendix E.

The Fenchel-Rockafellar duality theorem [47] states that the solution to (26) provides a lower bound on the minimum value to the solution of (21). Moreover, the gap between the primal objective function value of (21) and the corresponding dual objective function value of (26) at the $i$ th iterate is defined as

g(\bm{v},\bm{\lambda},\bm{\mu};\bm{v}^{[i]}):=G(\bm{v};\bm{v}^{[i]})-Q(\bm{\lambda},\bm{\mu};\bm{v}^{[i]}).

(27)

If this gap $g(\bm{v},\bm{\lambda},\bm{\mu};\bm{v}^{[i]})$ is zero, then the strong duality holds. That is, at the optimal solution $(\bm{v}^{[i+1]};\bm{\lambda}^{[i+1]},\bm{\mu}^{[i+1]})$ , we have

\Delta G(\bm{v}^{[i+1]};\bm{v}^{[i]})=g(\bm{v}^{[i+1]},\bm{\lambda}^{[i+1]},\bm{\mu}^{[i+1]};\bm{v}^{[i]}).

(28)

We now show that the duality gap vanish asymptotically in the following Theorem.

Theorem 3

Let $\{\bm{v}^{[i]}\}_{i=0}^{\infty}$ be the sequence generated by Algorithm 1 with $\bm{v}^{[0]}\in\mathcal{C}$ . Then $\|r^{[i+1]}\|_{2}^{2}$ has $O(\frac{1}{t})$ ergodic worst-case convergence rate.

Proof:

Please refer to Appendix F for details. ∎

V Numerical Experiments

In this section, we use numerical experiments to validate the effectiveness of our proposed algorithms and illustrate the presented theoretical results. We compare the log-sum function based three-stage GSBF approach with the coordinated beamforming approach (CB) [48] and mixed $\ell_{1,2}$ GSBF [29] beamforming approach (Mixed $\ell_{1,2}$ GSBF). These two approaches are listed below:

•

CB considers minimizing the total transmit power consumption. In other words, all BSs are required to perform the inference tasks from all MUs.
•

Mixed $\bm{\ell_{1,2}}$ GSBF considers adopting the mixed $\ell_{1,2}$ -norm (i.e., the objective function in (13) is replaced with $\sum_{n=1}^{N}\sum_{k=1}^{K}\rho_{nk}\|\bm{v}_{nk}\|_{2}$ ) to induce group sparsity of the beamforming vector in Stage 1 of Algorithm 2.

On the experimental set-up, we consider the edge AI inference system with $8$ $2$ -antennas, and $15$ single-antenna MUs that all are uniformly and independently distributed in a $\left[-0.5,0.5\right]$ km $\times$ $\left[-0.5,0.5\right]$ km square region. The channel between BS $n$ and MU $k$ is set as $\bm{h}_{nk}=10^{-L(d_{nk})/20}\bm{\xi}_{nk}$ , where the path-loss model is given by $L(d_{nk})=128.1+37.6\log_{10}d_{nk}$ and $d_{nk}$ is the Euclidean distance between BS $n$ and MU $k$ , $\bm{\xi}_{nk}$ is the small-scale fading coefficient, i.e., $\bm{\xi}_{nk}\sim\mathcal{CN}(\bm{0},\bm{I})$ . We set $P^{\sf{c}}_{nk}=0.45$ W and specify $P_{n}^{\max}=1$ W, $\eta_{n}=25\%$ and $\sigma_{k}^{2}=1$ . Furthermore, for the proposed log-sum function based three-stage GSBF approach, we set $p=100$ , $\beta=0.1$ and initialize $\bm{w}^{[0]}=\bm{1}$ . In particular, we terminate the proximal iteratively reweighted algorithm either it hits the predefined maximum iterations $\textrm{IterMax}=25$ or satisfies

\|\bm{w}^{[i+1]}-\bm{w}^{[i]}\|_{1}\leq\epsilon,

(29)

where $\epsilon=10^{-5}$ is a predescribed tolerance.

V-A Convergence of the Proximal Iteratively Reweighted Algorithm

The goal in this subsection is to illustrate the convergence behavior of the proposed proximal iteratively reweighted algorithm. The presented result is obtained in a typical channel realization. Fig. 3 illustrates the convergence of the proximal iteratively reweighted algorithm. We can see that $\Omega(\bm{v})$ steadily decreases along with the iterations, which is consistent with our analysis in Lemma 2. Interestingly, we observe that the objective value of $\Omega(\bm{v})$ drops quickly in the first few iterations (less $5$ iterations), which indicates that the proposed proximal iteratively reweighted algorithm converges very fast. In view of this, we may suggest early terminating the Algorithm 1 in practice to obtain an approximate solution to speed up the entire algorithm while guaranteeing the overall performance.

V-B Effectiveness of the Proposed Approach

We evaluate the performance of the three algorithms in terms of the overall network power consumption, the transmit power consumption and the number of computation tasks. The presented results are averaged over $100$ randomly and independently generated channel realizations.

Fig. 4 depicts the overall network power consumption of three approaches with different target SINRs. First, we observe that all three approaches have higher total power consumption as the required SINR becomes more stringent. This is because more edge BSs are required to transmit the inference results for higher QoS. In addition, we can see that CB approach has the highest power consumption among three approaches and the relative power difference between CB and the other two approaches can achieve approximately $67\%$ when SINR is $2$ dB and approximately $18\%$ when SINR is $8$ dB, indicating the effectiveness of joint task selection strategy and group sparse beamforming approach to minimize the overall network power consumption. On the other hand, we can see that the proposed log-sum function based three-stage GSBF approach outperforms the mixed $\ell_{1,2}$ GSBF approach, which demonstrates that enhance the group sparsity further reduces the overall network power consumption. In particular, we also observe that the performance gap between the blue and the red curve approximately remains at $9\%$ when SINR ranges from $4$ dB to $8$ dB, which indicates that the proposed log-sum function based three-stage GSBF approach is still attractive in the high SINR regime.

Tables I and II further demonstrate the number of inference tasks performed by edge BSs and the transmission power consumption, respectively. To be specific, in Table I, we observe that the number of performed inference tasks among three approaches is different under various SINRs, which shows the existence of the task selection strategy. Besides, it is observed that the log-sum function based three-stage GSBF approach always achieves a less number of performed inference tasks compared to the mixed $\ell_{1,2}$ GSBF approach for target SINRs, which indicates that the log-sum function based three-stage GSBF approach can enhance the group sparsity pattern in the beamforming vector. Meanwhile, as observed in Table II, the CB approach has the lowest transmission power compared to the other two approaches because the CB approach only optimizes the power consumption in transmission with performing all inference tasks. On the other hand, the transmission power consumption of the log-sum function based three-stage GSBF approach is slightly higher compared to the mixed $\ell_{1,2}$ GSBF approach under most SINRs. This is because more edge BSs participate in performing inference tasks in the mixed $\ell_{1,2}$ GSBF approach, resulting in a higher transmit beamforming gain for reducing transmission power. In other words, less number of performed inference tasks further reduces the computation power consumption of edge BSs but increases the transmission power consumption. Observe the Fig. 4 and Tables I-II together, it indicates that the proposed joint task selection strategy and GSBF approach find a good balance between computation power consumption and transmission power consumption, yielding lowest network power consumption.

TABLE I: The Number of Performed Inference Tasks with Different Approaches

Target SINR [dB]	Proposed	Mixed $\ell_{1,2}$ GSBF	CB
$0$	$16.49$	$17.79$	$120.00$
$2$	$21.00$	$22.50$	$120.00$
$4$	$30.08$	$34.47$	$120.00$
$6$	$43.57$	$52.17$	$120.00$
$8$	$67.76$	$77.36$	$120.00$

TABLE II: The Total Transmit Power Consumption with Different Approaches

Target SINR [dB]	Proposed	Mixed $\ell_{1,2}$ GSBF	CB
$0$	$4.16$	$4.68$	$0.41$
$2$	$9.12$	$8.31$	$0.88$
$4$	$12.24$	$11.76$	$1.99$
$6$	$15.81$	$14.70$	$4.57$
$8$	$18.81$	$18.58$	$10.79$

VI Conclusion

In this paper, we developed an energy-efficient edge AI inference system through the joint selection of the inference tasks and optimization of the transmit beamforming vectors for minimizing the computation power consumption and the downlink transmission power consumption, respectively. Based on the critical insight that the inference tasks selection can be achieved by controlling the group sparsity structure of transmit beamforming vectors, we developed a group sparse optimization framework for network power minimization, for which a log-sum function based three-stage group sparse beamforming algorithm was developed to enhance group sparsity in the solutions. To resolve the resulting nonconvex and nonsmooth log-sum function minimization problem, we further proposed a proximal iteratively reweighted algorithm. Furthermore, the global convergence analysis was provided, and a worst-case $O(1/t)$ convergence rate in an ergodic sense has been derived for this algorithm.

Appendix A Proof of Lemma 1

Let $\bm{x}^{[i]}\in\partial\|\bm{v}\|_{\mathcal{G}_{2}}$ and $\bm{u}^{[i]}\in\mathcal{N}_{\mathcal{C}}(\bm{v}^{[i]})$ . If $\bm{v}^{[i]}=\operatornamewithlimits{arg\,min}_{\bm{v}}G(\bm{v};\bm{v}^{[i]})$ , by (24), we have

\displaystyle\bm{0}=\bm{w}^{[i]}\odot\bm{x}^{[i]}+\bm{u}^{[i]}.

(30)

We conclude that $\bm{v}^{[i]}$ satisfies (22), indicating that $\bm{v}^{[i]}$ is first-order optimal for (20).

Conversely, if $\bm{v}^{[i]}$ satisfies (22), implying $\bm{v}^{[i]}$ satisfies (30) by Proposition 2. Thus $\bm{v}^{[i]}$ must be the optimal solution to the subproblem (15). This completes the proof.

Appendix B Proof of Lemma 2

First of all, $\bm{v}^{[i+1]}=\operatornamewithlimits{arg\,min}_{\bm{v}}G(\bm{v};\bm{v}^{[i]})$ and $G(\bm{v};\bm{v}^{[i]})$ is convex, so that $G(\bm{v}^{[i+1]};\bm{v}^{[i]})-G(\bm{v}^{[i]};\bm{v}^{[i]})\leq 0$ . Since $f_{p}(\cdot)$ is concave, we have,

\displaystyle f_{p}(\bm{z})\leq f_{p}(\bm{z}_{0})+\langle\nabla f_{p}(\bm{z}_{0}),(\bm{z}-\bm{z}_{0})\rangle,~{}\forall\bm{z},\bm{z}_{0}\in\mathbb{R}_{+}^{L}.

(31)

Therefore

\displaystyle\begin{aligned} J(\bm{v}^{[i+1]})&=\Omega(\bm{v}^{[i+1]})\\ &\leq\Omega(\bm{v}^{[i]})+\sum_{n=1}^{N}\sum_{k=1}^{K}w_{nk}^{[i]}(\|\bm{v}_{nk}^{[i+1]}\|_{2}-\|\bm{v}_{nk}^{[i]}\|_{2})\\ &\quad+\frac{\beta}{2}\|\bm{v}^{[i+1]}-\bm{v}^{[i]}\|_{2}\\ &=J(\bm{v}^{[i]})+G(\bm{v}^{[i+1]};\bm{v}^{[i]})-G(\bm{v}^{[i]};\bm{v}^{[i]}),\end{aligned}

(32)

where the first inequality follows from (31). This completes the first statement $(i)$ .

On the other hand, by (32), we have

\displaystyle\Delta G(\bm{v}^{[i+1]};\bm{v}^{[i]})\leq J(\bm{v}^{[i]})-J(\bm{v}^{[i+1]}).

(33)

Summing both sides of (33) over $i=0,\ldots,t$ , yielding

\displaystyle\begin{aligned} 0\leq\sum_{i=0}^{t}\Delta G(\bm{v}^{[i+1]};\bm{v}^{[i]})&\leq J(\bm{v}^{[0]})-J(\bm{v}^{[t+1]})\\ &\leq J(\bm{v}^{[0]})-\tilde{J},\end{aligned}

(34)

where $\tilde{J}$ is the lower bound of $J(\cdot)$ . Allowing $t\to\infty$ , we have

\displaystyle\lim_{i\to\infty}\Delta G(\bm{v}^{[i+1]};\bm{v}^{[i]})=0.

(35)

This completes the second statement $(ii)$ .

For the last statement $(iii)$ , inspired by the proof line presented in [37]. By (24), we have

\displaystyle\bm{0}=\beta(\bm{v}^{[i+1]}-\bm{v}^{[i]})+\widetilde{\bm{x}}^{[i+1]}+\bm{u}^{[i+1]},

(36)

where $\widetilde{\bm{x}}^{[i+1]}\in\partial\langle\bm{w}^{[i]},\|\bm{v}^{[i+1]}\|_{\mathcal{G}_{2}}\rangle$ , $\bm{u}^{[i]}\in\mathcal{N}_{\mathcal{C}}(\bm{v}^{[i]})$ . Taking a dot-product with $(\bm{v}^{[i+1]}-\bm{v}^{[i]})$ on both sides of (36) yields

\displaystyle\begin{aligned} \bm{0}=&\beta\|\bm{v}^{[i+1]}-\bm{v}^{[i]}\|_{2}^{2}+\langle\widetilde{\bm{x}}^{[i+1]},\bm{v}^{[i+1]}-\bm{v}^{[i]}\rangle\\ &+\langle\bm{u}^{[i+1]},\bm{v}^{[i+1]}-\bm{v}^{[i]}\rangle.\end{aligned}

(37)

By the definition of subgradient of the convex function, we have,

\displaystyle\begin{aligned} \langle\bm{w}^{[i]},\|\bm{v}^{[i]}\|_{\mathcal{G}_{2}}-\|\bm{v}^{[i+1]}\|_{\mathcal{G}_{2}}\rangle&\geq\langle\widetilde{\bm{x}}^{[i+1]},\bm{v}^{[i]}-\bm{v}^{[i+1]}\rangle\\ &=\langle\bm{u}^{[i+1]},\bm{v}^{[i+1]}-\bm{v}^{[i]}\rangle\\ &\quad+\beta\|\bm{v}^{[i+1]}-\bm{v}^{[i]}\|_{2}^{2}.\end{aligned}

(38)

Therefore

\displaystyle\begin{aligned} &\quad\,\,G(\bm{v}^{[i]};\bm{v}^{[i]})-G(\bm{v}^{[i+1]};\bm{v}^{[i]})\\ &=\!\sum_{n=1}^{N}\!\sum_{k=1}^{K}w_{nk}^{[i]}(\|\bm{v}_{nk}^{[i]}\|_{2}-\|\bm{v}_{nk}^{[i+1]}\|_{2})\!-\!\frac{\beta}{2}\|\bm{v}^{[i+1]}-\bm{v}^{[i]}\|_{2}^{2}\\ &=\langle\bm{w}^{[i]},\|\bm{v}^{[i]}\|_{2}-\|\bm{v}^{[i+1]}\|_{2}\rangle-\frac{\beta}{2}\|\bm{v}^{[i+1]}-\bm{v}^{[i]}\|_{2}^{2}\\ &\geq\langle\bm{u}^{[i+1]},\bm{v}^{[i+1]}-\bm{v}^{[i]}\rangle+\frac{\beta}{2}\|\bm{v}^{[i+1]}-\bm{v}^{[i]}\|_{2}^{2}\\ &=\frac{\beta}{2}\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\|_{2}^{2}-\langle\bm{u}^{[i+1]},\bm{v}^{[i]}-\bm{v}^{[i+1]}\rangle\\ &\geq\frac{\beta}{2}\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\|_{2}^{2},\end{aligned}

(39)

where the first inequality is obtained by (38) and the last inequality holds since $\bm{u}^{[i+1]}\in\mathcal{N}_{\mathcal{C}}(\bm{v}^{[i+1]})$ , completing the proof.

Appendix C Proof of Theorem 2

By Lemma 2, the sequence $\{J(\bm{v}^{[i]})\}$ is monotonically decreasing. Since $J(\cdot)$ is coercive, we conclude that the sequence $\{\bm{v}^{[i]}\}$ is bounded. We conclude that $\{\bm{v}^{[i]}\}$ must have cluster points. Let $\bm{v}^{*}$ be a cluster point of $\{\bm{v}^{[i]}\}$ . From Lemma 1, it suffices to show that $\bm{v}^{*}=\operatornamewithlimits{arg\,min}_{\bm{v}}G(\bm{v};\bm{v}^{*})$ . We prove this by contradiction. Assume that there exists a point $\tilde{\bm{v}}$ such that $\epsilon:=G(\bm{v}^{*};\bm{v}^{*})-G(\tilde{\bm{v}};\bm{v}^{*})>0$ . Suppose $\{\bm{v}^{[i]}\}_{\mathcal{S}}\to\bm{v}^{*}$ , $\mathcal{S}\subset\mathbb{N}$ . Based on Lemma 2, there must exist $k_{1}>0$ such that for all $k>k_{1},k\in\mathcal{S}$

\displaystyle G(\bm{v}^{[i]};\bm{v}^{[i]})-G(\bm{v}^{[i+1]};\bm{v}^{[i]})\leq\epsilon/4.

(40)

Note that $\bm{v}_{nk}^{[i]}\stackrel{{\scriptstyle\mathcal{S}}}{{\to}}\bm{v}_{nk}^{*}$ and $w_{nk}^{[i]}\stackrel{{\scriptstyle\mathcal{S}}}{{\to}}w_{nk}^{*}$ , there exists $k_{2}$ such that for all $k>k_{2},k\in\mathcal{S}$ , $\!\sum_{n=1}^{N}\!\sum_{k=1}^{K}\!w_{nk}^{[i]}\|\bm{v}_{nk}^{[i]}\|_{2}-w_{nk}^{*}\|\bm{v}_{nk}^{*}\|_{2}>-\epsilon/10,$ and $\frac{\beta}{2}\|\bm{v}^{*}\!-\!\tilde{\bm{v}}\|_{2}^{2}+\!\sum_{n,k}\!(w_{nk}^{*}\!-\!w_{nk}^{[i]})\|\tilde{\bm{v}}_{nk}\|_{2}\!-\!\frac{\beta}{2}\|\tilde{\bm{v}}\!-\!\bm{v}^{[i]}\|_{2}^{2}>-\epsilon/10$ .

Therefore, for all $k>k_{2},k\in\mathcal{S}$ ,

\displaystyle\begin{aligned} &\quad\,\,G(\bm{v}^{*};\bm{v}^{*})-G(\tilde{\bm{v}};\bm{v}^{[i]})\\ &=\sum_{n=1}^{N}\sum_{k=1}^{K}w_{nk}^{*}\|\bm{v}_{nk}^{*}\|_{2}-\frac{\beta}{2}\|\tilde{\bm{v}}-\bm{v}^{[i]}\|_{2}^{2}\\ &\quad-\sum_{n=1}^{N}\sum_{k=1}^{K}(w_{nk}^{*}-(w_{nk}^{*}-w_{nk}^{[i]}))\|\tilde{\bm{v}}_{nk}\|_{2}\\ &=\left[G(\bm{v}^{*};\bm{v}^{*})-G(\tilde{\bm{v}};\bm{v}^{*})\right]+\frac{\beta}{2}\|\tilde{\bm{v}}-\bm{v}^{*}\|_{2}^{2}\\ &\quad+\sum_{n=1}^{N}\sum_{k=1}^{K}(w_{nk}^{*}-w_{nk}^{[i]})\|\tilde{\bm{v}}_{nk}\|_{2}-\frac{\beta}{2}\|\tilde{\bm{v}}-\bm{v}^{[i]}\|_{2}^{2}\\ &\geq\epsilon-\epsilon/10=9\epsilon/10,\end{aligned}

(41)

and that

\displaystyle\begin{aligned} &\quad\,\,G(\bm{v}^{[i]};\bm{v}^{[i]})-G(\bm{v}^{*};\bm{v}^{*})\\ &=\sum_{n=1}^{N}\sum_{k=1}^{K}w_{nk}^{[i]}\|\bm{v}_{nk}^{[i]}\|_{2}-\sum_{n=1}^{N}\sum_{k=1}^{K}w_{nk}^{*}\|\bm{v}_{nk}^{*}\|>-\epsilon/10.\end{aligned}

(42)

Hence, for all $k>\max(k_{1},k_{2}),k\in\mathcal{S}$ , it holds that

\displaystyle\begin{aligned} &\quad\,\,G(\bm{v}^{[i]};\bm{v}^{[i]})-G(\tilde{\bm{v}};\bm{v}^{[i]})\\ &=G(\bm{v}^{[i]};\bm{v}^{[i]})-G(\bm{v}^{*};\bm{v}^{*})+G(\bm{v}^{*};\bm{v}^{*})-G(\tilde{\bm{v}};\bm{v}^{[i]})\\ &\geq-\epsilon/10+9\epsilon/10=4\epsilon/5,\end{aligned}

(43)

contradicting (40). Hence, $\bm{v}^{*}=\operatornamewithlimits{arg\,min}_{\bm{v}}G(\bm{v};\bm{v}^{*})$ . By Lemma 1, $\bm{v}^{*}$ satisfies the first-order necessary optimality for (20). This completes the proof.

Appendix D Proof of Lemma 3

Recall the first-order necessary optimality condition (24) of subproblem (15). By rearranging the term, and we have

\displaystyle\bm{v}^{[i]}-\bm{v}^{[i+1]}=\frac{1}{\beta}(\bm{w}^{[i]}\odot\bm{x}^{[i+1]}+\bm{u}^{[i+1]}),

(44)

where $\bm{x}^{[i+1]}\in\partial\|\bm{v}^{[i+1]}\|_{\mathcal{G}_{2}}$ and $\bm{u}^{[i+1]}\in\mathcal{N}_{\mathcal{C}}(\bm{v}^{[i+1]})$ . Square on both sides of (44), we have

\displaystyle\begin{aligned} &\quad\,\,\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\|_{2}^{2}\\ &=\frac{1}{\beta^{2}}\|\bm{w}^{[i]}\odot\bm{x}^{[i+1]}+\bm{u}^{[i+1]}\|_{2}^{2}\\ &=\frac{1}{\beta^{2}}\|(\bm{w}^{[i]}-\bm{w}^{[i+1]})\odot\bm{x}^{[i+1]}+\bm{r}^{[i+1]}\|_{2}^{2}\\ &=\frac{1}{\beta^{2}}\|(\bm{w}^{[i]}-\bm{w}^{[i+1]})\odot\bm{x}^{[i+1]}\|_{2}^{2}+\frac{1}{\beta^{2}}\|\bm{r}^{[i+1]}\|_{2}^{2}\\ &\quad+\frac{2}{\beta^{2}}\langle(\bm{w}^{[i]}-\bm{w}^{[i+1]})\odot\bm{x}^{[i+1]},\bm{r}^{[i+1]}\rangle\\ &=\frac{1}{\beta^{2}}\|(\bm{w}^{[i]}-\bm{w}^{[i+1]})\odot\bm{x}^{[i+1]}\|_{2}^{2}+\frac{1}{\beta^{2}}\|\bm{r}^{[i+1]}\|_{2}^{2}\\ &\quad-\frac{2}{\beta^{2}}\langle(\bm{w}^{[i]}-\bm{w}^{[i+1]})\odot\bm{x}^{[i+1]},\\ &\quad(\bm{w}^{[i]}-\bm{w}^{[i+1]})\odot\bm{x}^{[i+1]}+\beta(\bm{v}^{[i+1]}-\bm{v}^{[i]})\rangle\\ &=-\frac{1}{\beta^{2}}\|(\bm{w}^{[i]}-\bm{w}^{[i+1]})\odot\bm{x}^{[i+1]}\|_{2}^{2}+\frac{1}{\beta^{2}}\|\bm{r}^{[i+1]}\|_{2}^{2}\\ &\quad-\frac{2}{\beta}\langle(\bm{w}^{[i]}-\bm{w}^{[i+1]})\odot\bm{x}^{[i+1]},\bm{v}^{[i+1]}-\bm{v}^{[i]}\rangle,\end{aligned}

(45)

where we exploit $\bm{u}^{[i+1]}=\beta(\bm{v}^{[i]}-\bm{v}^{[i+1]})-\bm{w}^{[i]}\odot\bm{x}^{[i+1]}$ to obtain the fourth equality. Then we have

\displaystyle\begin{aligned} &\quad\,\,\|\bm{r}^{[i+1]}\|_{2}^{2}\\ &=\beta^{2}\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\|_{2}^{2}+\|(\bm{w}^{[i]}-\bm{w}^{[i+1]})\odot\bm{x}^{[i+1]}\|_{2}^{2}\\ &\quad+2\beta\langle(\bm{w}^{[i]}-\bm{w}^{[i+1]})\odot\bm{x}^{[i+1]},\bm{v}^{[i+1]}-\bm{v}^{[i]}\rangle\\ &\leq\beta^{2}\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\|_{2}^{2}+\|\bm{w}^{[i]}-\bm{w}^{[i+1]}\|_{2}^{2}\\ &\quad+2\beta\|(\bm{w}^{[i]}-\bm{w}^{[i+1]})\odot\bm{x}^{[i+1]}\|_{2}\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\|_{2}\\ &\leq\beta^{2}\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\|_{2}^{2}+\|\bm{w}^{[i]}-\bm{w}^{[i+1]}\|_{2}^{2}\\ &\quad+2\beta\|\bm{w}^{[i]}-\bm{w}^{[i+1]}\|_{2}\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\|_{2}\\ &\leq\beta^{2}\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\|_{2}^{2}+p^{4}\kappa^{2}\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\|_{2}^{2}\\ &\quad+2\beta p^{2}\kappa\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\|_{2}^{2}\\ &=(\beta^{2}+2\beta\kappa p^{2}+\kappa^{2}p^{4})\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\|_{2}^{2},\end{aligned}

(46)

where the last inequality holds due to $f^{\prime}_{p}(\cdot)$ is $p^{2}$ -Lipschitz continuous and $\kappa=\rho^{\max}$ . This completes the proof.

Appendix E Construction of Fenchel-Rockafellar Dual (26)

This construction of the dual program owes to the conjugate function [47]. We first rewrite the objective $G(\bm{v};\bm{v}^{[i]})$ by exploiting the conjugate function as

\displaystyle\begin{aligned} G(\bm{v};\bm{v}^{[i]})&=\sum_{n=1}^{N}\sum_{k=1}^{K}w_{nk}^{[i]}[\sup_{\bm{\lambda}_{nk}}\langle\bm{\lambda}_{nk},\bm{v}_{nk}\rangle-\delta_{\mathbb{B}}(\bm{\lambda}_{nk})]\\ &\quad+\frac{\beta}{2}\|\bm{v}-\bm{v}^{[i]}\|_{2}^{2}+\sup_{\bm{\mu}}[\langle\bm{\mu},\bm{v}\rangle-\delta_{\mathcal{C}}^{*}(\bm{\mu})],\end{aligned}

(47)

where $\mathbb{B}:=\{\bm{y}\in\mathbb{C}^{L}\mid\|\bm{y}\|_{2}\leq 1\}$ denotes the dual norm unit ball of $\|\cdot\|_{2}$ . Then the primal subproblem (21) is simply

	$\displaystyle\inf_{\bm{v}}G(\bm{v};\bm{v}^{[i]})=\inf_{\bm{v}}\sup_{\bm{\lambda},\bm{\mu}}{}$	$\displaystyle\sum_{n=1}^{N}\sum_{k=1}^{K}w_{nk}^{[i]}[\langle\bm{\lambda}_{nk},\bm{\mu}_{nk}\rangle-\delta_{\mathbb{B}}(\bm{\lambda}_{nk})]$		(48)
		$\displaystyle+\frac{\beta}{2}\\|\bm{v}-\bm{v}^{[i]}\\|_{2}^{2}+\langle\bm{\mu},\bm{v}\rangle-\delta_{\mathcal{C}}^{*}(\bm{\mu}).$		(48)

Swapping the order of $\inf$ and $\sup$ gives the corresponding Lagrangian dual, i.e.,

\displaystyle\begin{aligned} \sup_{\bm{\lambda},\bm{\mu}}\inf_{\bm{v}}~{}&\sum_{n=1}^{N}\sum_{n=k}^{K}w_{nk}^{[i]}[\langle\bm{\lambda}_{nk},\bm{v}_{nk}\rangle-\delta_{\mathbb{B}}(\bm{\lambda}_{nk})]\\ &+\frac{\beta}{2}\|\bm{v}-\bm{v}^{[i]}\|_{2}^{2}+\langle\bm{\mu},\bm{v}\rangle-\delta_{\mathcal{C}}^{*}(\bm{\mu}).\end{aligned}

(49)

The dual objective at $\bm{v}^{[i]}$ is then given by

	$\displaystyle Q(\bm{\lambda},\bm{\mu};\bm{v}^{[i]})=\inf_{\bm{v}}{}$	$\displaystyle\sum_{n=1}^{N}\sum_{k=1}^{K}w_{nk}^{[i]}[\langle\bm{\lambda}_{nk},\bm{v}_{nk}\rangle-\delta_{\mathbb{B}}(\bm{\lambda}_{nk})]$		(50)
		$\displaystyle+\frac{\beta}{2}\\|\bm{v}-\bm{v}^{[i]}\\|_{2}^{2}+\langle\bm{\mu},\bm{v}\rangle-\delta_{\mathcal{C}}^{*}(\bm{\mu}).$		(50)

By exploiting the first-order necessary optimality condition of (50), we have

	$\displaystyle\sup_{\bm{\lambda},\bm{\mu}}{}$	$\displaystyle-\frac{1}{2\beta}\\|\bm{\lambda}\odot\bm{w}^{[i]}+\bm{\mu}-\beta\bm{v}^{[i]}\\|_{2}^{2}+\frac{\beta}{2}\\|\bm{v}^{[i]}\\|_{2}^{2}-\delta_{\mathcal{C}}^{*}(\bm{\mu})$		(51)
		$\displaystyle\quad-\sum_{n=1}^{N}\sum_{k=1}^{K}w_{nk}^{[i]}\delta_{\mathbb{B}}(\bm{\lambda}_{nk}),$		(51)

which is consistent with the presented Fenchel-Rockafellar in (26). This completes the construction.

Appendix F Proof of Theorem 3

We evaluate $g(\bm{v},\bm{x},\bm{u};\bm{v}^{[i]})$ at $\bm{v}^{[i]}$ with $\bm{\lambda}^{[i+1]}\!\in\!\partial\|\bm{v}^{[i+1]}\|_{\mathcal{G}_{2}}$ and $\bm{\mu}^{[i+1]}\in\mathcal{N}_{\mathcal{C}}(\bm{v}^{[i+1]})$ . Then, we have

		$\displaystyle\quad\,\,g(\bm{v}^{[i]},\bm{\lambda}^{[i+1]},\bm{\mu}^{[i+1]};\bm{v}^{[i]})$		(52)
		$\displaystyle=G(\bm{v}^{[i]};\bm{v}^{[i]})-Q(\bm{\lambda}^{[i+1]},\bm{\mu}^{[i+1]};\bm{v}^{[i]})$
		$\displaystyle=\sum_{n=1}^{N}\sum_{k=1}^{K}w_{nk}^{[i]}\\|\bm{v}_{nk}^{[i]}\\|_{2}+\delta_{\mathcal{C}}(\bm{v}^{[i]})-\frac{\beta}{2}\\|\bm{v}^{[i]}\\|_{2}^{2}$
		$\displaystyle\quad+\frac{1}{2\beta}\\|\bm{\lambda}^{[i+1]}\odot\bm{w}^{[i]}+\bm{\mu}^{[i+1]}-\beta\bm{v}^{[i]}\\|_{2}^{2}+\delta_{\mathcal{C}}^{*}(\bm{\mu}^{[i+1]})$
		$\displaystyle\geq\frac{1}{2\beta}\\|\bm{w}^{[i]}\!\odot\!\bm{\lambda}^{[i+1]}\!+\!\bm{\mu}^{[i+1]}\\|_{2}^{2}\!+\!\langle\bm{w}^{[i]},\\|\bm{v}^{[i]}\\|-\bm{v}^{[i]}\!\odot\!\bm{\lambda}^{[i+1]}\rangle$
		$\displaystyle\quad+\langle\bm{\mu}^{[i+1]},\bm{v}^{[i]}-\bm{v}^{[i]}\rangle$
		$\displaystyle\geq\frac{1}{2\beta}\\|\bm{w}^{[i]}\odot\bm{\lambda}^{[i+1]}+\bm{\mu}^{[i+1]}\\|_{2}^{2},$

where the first inequality holds due to the Fenchel-Young inequality and the second inequality is obtained since $\|\bm{\lambda}^{[i+1]}\|_{2}\leq 1$ makes $\|\bm{v}^{[i]}\|_{\mathcal{G}_{2}}-\bm{v}^{[i]}\odot\bm{\lambda}^{[i+1]}\geq 0$ .

Next, by making use of (24) to replace $\bm{\mu}^{[i+1]}$ with $\beta(\bm{v}^{[i]}-\bm{v}^{[i+1]})-\bm{w}^{[i]}\odot\bm{\lambda}^{[i+1]}$ in (52), we have

		$\displaystyle\quad\,\,\frac{1}{2\beta}\\|\bm{w}^{[i]}\odot\bm{\lambda}^{[i+1]}+\bm{\mu}^{[i+1]}\\|_{2}^{2}$		(53)
		$\displaystyle=\frac{1}{2\beta}\\|\bm{w}^{[i+1]}\!\odot\!\bm{\lambda}^{[i+1]}+\bm{\mu}^{[i+1]}+(\bm{w}^{[i]}-\bm{w}^{[i+1]})\!\odot\!\bm{\lambda}^{[i+1]}\\|_{2}^{2}$
		$\displaystyle=\frac{1}{2\beta}(\\|\bm{r}^{[i+1]}\\|_{2}^{2}+\\|(\bm{w}^{[i]}-\bm{w}^{[i+1]})\odot\bm{\lambda}^{[i+1]}\\|_{2}^{2}$
		$\displaystyle\quad+2\langle\bm{r}^{[i+1]},(\bm{w}^{[i]}-\bm{w}^{[i+1]})\odot\bm{\lambda}^{[i+1]}\rangle)$
		$\displaystyle=\frac{1}{2\beta}\\|\bm{r}^{[i+1]}\\|_{2}^{2}-\frac{1}{2\beta}\\|(\bm{w}^{[i]}-\bm{w}^{[i+1]})\odot\bm{\lambda}^{[i+1]}\\|_{2}^{2}$
		$\displaystyle\quad-\langle\bm{v}^{[i+1]}-\bm{v}^{[i]},(\bm{w}^{[i]}-\bm{w}^{[i+1]})\odot\bm{\lambda}^{[i+1]}\rangle$
		$\displaystyle\geq\frac{1}{2\beta}\\|\bm{r}^{[i+1]}\\|_{2}^{2}-\\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\\|_{2}\\|\bm{w}^{[i]}-\bm{w}^{[i+1]}\\|_{2}$
		$\displaystyle\quad-\frac{1}{2\beta}\\|\bm{w}^{[i]}-\bm{w}^{[i+1]}\\|_{2}^{2},$

where we exploit Cauchy–Bunyakovskii–Schwarz (CBS) inequality to obtain the last inequality.

On the other hand, by strong duality, we have

\displaystyle g(\bm{v}^{[i]},\bm{\lambda}^{[i+1]},\bm{\mu}^{[i+1]};\bm{v}^{[i]})\!=\!G(\bm{v}^{[i]};\bm{v}^{[i]})-G(\bm{v}^{[i+1]};\bm{v}^{[i]}).

(54)

Combine (52), (53) and (54), we have

		$\displaystyle\quad\,\,\\|\bm{r}^{[i+1]}\\|_{2}^{2}$		(55)
		$\displaystyle\leq 2\beta\Delta G(\bm{v}^{[i+1]};\bm{v}^{[i]})+\\|\bm{w}^{[i]}-\bm{w}^{[i+1]}\\|_{2}^{2}$
		$\displaystyle\quad+2\beta\\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\\|_{2}\\|\bm{w}^{[i]}-\bm{w}^{[i+1]}\\|_{2}$
		$\displaystyle\leq 2\beta\Delta G(\bm{v}^{[i+1]};\bm{v}^{[i]})+\kappa^{2}p^{4}\\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\\|_{2}^{2}$
		$\displaystyle\quad+2\beta\kappa p^{2}\\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\\|_{2}^{2}$
		$\displaystyle\leq 2\beta\Delta G(\bm{v}^{[i+1]};\bm{v}^{[i]})+(\kappa^{2}p^{4}+2\beta\kappa p^{2})\\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\\|_{2}^{2}$
		$\displaystyle\leq 2\beta\Delta G(\bm{v}^{[i+1]};\bm{v}^{[i]})+\frac{2}{\beta}(\kappa^{2}p^{4}+2\beta\kappa p^{2})\Delta G(\bm{v}^{[i+1]};\bm{v}^{[i]})$
		$\displaystyle\leq\frac{2}{\beta}(\beta^{2}+2\beta\kappa p^{2}+\kappa^{2}p^{4})\Delta G(\bm{v}^{[i+1]};\bm{v}^{[i]})$
		$\displaystyle\leq\frac{2}{\beta}(\beta^{2}+2\beta\kappa p^{2}+\kappa^{2}p^{4})(J(\bm{v}^{[i]})-J(\bm{v}^{[i+1]})),$

where the fourth inequality holds due to Lemma 2 $(iii)$ .

Summing up both sides of above inequality from $i=0$ to $t$ , we have

	$\displaystyle C(J(\bm{v}^{[0]})-J(\bm{v}^{[t+1]}))$	$\displaystyle\geq\sum_{i=0}^{t}\\|\bm{r}^{[i+1]}\\|_{2}^{2}$		(56)
		$\displaystyle\geq t\min_{i=0,\cdots,t}\\|\bm{r}^{[i+1]}\\|_{2}^{2}$		(56)

with $C=\frac{2(\beta^{2}+2\beta\kappa p^{2}+\kappa^{2}p^{4})}{\beta}$ . This indicates that

	$\displaystyle\min_{i=0,\cdots,t}\\|\bm{r}^{[i+1]}\\|_{2}^{2}$	$\displaystyle\leq\!\frac{2(\beta^{2}+2\beta\kappa p^{2}+\kappa^{2}p^{4})}{t\beta}(J(\bm{v}^{[0]})\!-\!J(\bm{v}^{[t+1]}))$		(57)
		$\displaystyle\leq\frac{2(\beta^{2}+2\beta\kappa p^{2}+\kappa^{2}p^{4})}{t\beta}(J(\bm{v}^{[0]})-J(\bm{v}^{*})).$		(57)

Hence

\min_{i=0,\cdots,t}\|\bm{r}^{[i+1]}\|_{2}^{2}=O(\frac{1}{t}).

(58)

This completes the proof.

References

[1] S. Hua, X. Yang, K. Yang, G. Yin, Y. Shi, and H. Wang, “Deep learning tasks processing in fog-RAN,” in Proc. IEEE Veh. Technol. Conf. (VTC), pp. 1–5, IEEE, 2019.
[2] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
[3] A. Voulodimos, N. Doulamis, A. Doulamis, and E. Protopapadakis, “Deep learning for computer vision: A brief review,” Comput. Intell. Neurosci., vol. 2018, pp. 1–13, 2018.
[4] C. V. N. Index, “Global mobile data traffic forecast update, 2017–2022,” White paper, Feb. 2019.
[5] Q. Zhang, L. Cheng, and R. Boutaba, “Cloud computing: state-of-the-art and research challenges,” J. Internet Services Appl., vol. 1, no. 1, pp. 7–18, 2010.
[6] Z. Yan and C. Chen, “Cloud-based video communication and networking: an architectural overview,” J. Commun. Information Netw., vol. 1, no. 1, pp. 27–43, 2016.
[7] J. Chen and X. Ran, “Deep learning with edge computing: A review,” Proc. IEEE, vol. 107, no. 8, pp. 1655–1674, 2019.
[8] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A survey on mobile edge computing: The communication perspective,” IEEE Commun. Surveys Tuts., vol. 19, no. 4, pp. 2322–2358, 2017.
[9] K. B. Letaief, W. Chen, Y. Shi, J. Zhang, and Y.-J. A. Zhang, “The roadmap to 6G: AI empowered wireless networks,” IEEE Commun. Mag., vol. 57, no. 8, pp. 84–90, 2019.
[10] Z. Zhang, Y. Xiao, Z. Ma, M. Xiao, Z. Ding, X. Lei, G. K. Karagiannidis, and P. Fan, “6G wireless networks: Vision, requirements, architecture, and key technologies,” IEEE Veh. Technol. Mag., vol. 14, no. 3, pp. 28–41, 2019.
[11] J. Zhang and K. B. Letaief, “Mobile edge intelligence and computing for the Internet of vehicles,” Proc. IEEE, vol. 108, no. 2, pp. 246–261, 2020.
[12] Y. Huang, C. Xu, C. Zhang, M. Hua, and Z. Zhang, “An overview of intelligent wireless communications using deep reinforcement learning,” J. Commun. Information Netw., vol. 4, no. 2, pp. 15–29, 2019.
[13] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over-the-air computation,” IEEE Trans. Wireless Commun., vol. 19, no. 3, pp. 2022–2035, 2020.
[14] Z. Zhou, X. Chen, E. Li, L. Zeng, K. Luo, and J. Zhang, “Edge intelligence: Paving the last mile of artificial intelligence with edge computing,” Proc. IEEE, vol. 107, no. 8, pp. 1738–1762, 2019.
[15] T. Jiang, X. Yang, Y. Shi, and H. Wang, “Layer-wise deep neural network pruning via iteratively reweighted optimization,” in Proc. IEEE Int. Conf. Acoustics Speech Signal Process. (ICASSP), pp. 5606–5610, IEEE, 2019.
[16] A. Ren, T. Zhang, S. Ye, J. Li, W. Xu, X. Qian, X. Lin, and Y. Wang, “Admm-nn: An algorithm-hardware co-design framework of dnns using alternating direction methods of multipliers,” in Proc. ASPLOS., pp. 925–938, 2019.
[17] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing of deep neural networks: A tutorial and survey,” Proc. IEEE, vol. 105, no. 12, pp. 2295–2329, 2017.
[18] G. Plastiras, M. Terzi, C. Kyrkou, and T. Theocharidcs, “Edge intelligence: Challenges and opportunities of near-sensor machine learning applications,” in 2018 IEEE 29th Int. Conf. on Appl.-specific Syst. Archit. and Processors (ASAP), pp. 1–7, IEEE, 2018.
[19] K. Yang, Y. Shi, and Z. Ding, “Data shuffling in wireless distributed computing via low-rank optimization,” IEEE Trans. Signal Process., vol. 67, no. 12, pp. 3087–3099, 2019.
[20] E. Li, L. Zeng, Z. Zhou, and X. Chen, “Edge AI: On-demand accelerating deep neural network inference via edge computing,” IEEE Trans. Wireless Commun., vol. 19, no. 1, pp. 447–457, 2020.
[21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vision Pattern Recognition (CVPR), pp. 770–778, 2016.
[22] S. Hua, Y. Zhou, K. Yang, and Y. Shi, “Reconfigurable intelligent surface for green edge inference,” arXiv preprint arXiv:1912.00820, 2019.
[23] K. Yang, Y. Shi, W. Yu, and Z. Ding, “Energy-efficient processing and robust wireless cooperative transmission for edge inference,” arXiv preprint arXiv:1907.12475, 2019.
[24] D. Gesbert, S. Hanly, H. Huang, S. S. Shitz, O. Simeone, and W. Yu, “Multi-cell MIMO cooperative networks: A new look at interference,” IEEE J. Sel. Areas Commun., vol. 28, no. 9, pp. 1380–1408, 2010.
[25] K. Li, M. Tao, and Z. Chen, “Exploiting computation replication for mobile edge computing: A fundamental computation-communication tradeoff study,” arXiv preprint arXiv:1903.10837, 2019.
[26] A. Y. Kruger, “On fréchet subdifferentials,” J. Math. Sci., vol. 116, no. 3, pp. 3325–3358, 2003.
[27] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski, “Optimization with sparsity-inducing penalties,” Found. Trends in Mach. Learn., vol. 4, no. 1, pp. 1–106, 2012.
[28] R. Jenatton, J.-Y. Audibert, and F. Bach, “Structured variable selection with sparsity-inducing norms,” J. Mach. Learn. Research, vol. 12, no. Oct, pp. 2777–2824, 2011.
[29] Y. Shi, J. Zhang, and K. B. Letaief, “Group sparse beamforming for green cloud-RAN,” IEEE Trans. Wireless Commun., vol. 13, no. 5, pp. 2809–2823, 2014.
[30] B. Dai and W. Yu, “Energy efficiency of downlink transmission strategies for cloud radio access networks,” IEEE J. Sel. Areas Commun., vol. 34, no. 4, pp. 1037–1050, 2016.
[31] L. Liu, E. G. Larsson, W. Yu, P. Popovski, C. Stefanovic, and E. De Carvalho, “Sparse signal processing for grant-free massive connectivity: A future paradigm for random access protocols in the Internet of things,” IEEE Signal Process. Mag., vol. 35, no. 5, pp. 88–99, 2018.
[32] Y. Shi, J. Dong, and J. Zhang, Low-overhead Communications in IoT Networks–Structured Signal Processing Approaches. Springer Singapore, 2020.
[33] O. Mehanna, N. D. Sidiropoulos, and G. B. Giannakis, “Joint multicast beamforming and antenna selection,” IEEE Trans. Signal Process., vol. 61, no. 10, pp. 2660–2674, 2013.
[34] E. J. Candes, M. B. Wakin, and S. P. Boyd, “Enhancing sparsity by reweighted $\ell_{1}$ minimization,” J. Fourier Anal. Appl., vol. 14, no. 5-6, pp. 877–905, 2008.
[35] Y. Shi, J. Cheng, J. Zhang, B. Bai, W. Chen, and K. B. Letaief, “Smoothed $\ell_{p}$ -minimization for green cloud-RAN with user admission control,” IEEE J. Sel. Areas Commun., vol. 34, no. 4, pp. 1022–1036, 2016.
[36] P. Ochs, A. Dosovitskiy, T. Brox, and T. Pock, “On iteratively reweighted algorithms for nonsmooth nonconvex optimization in computer vision,” SIAM J. Imag. Sci., vol. 8, no. 1, pp. 331–372, 2015.
[37] C. Lu, Y. Wei, Z. Lin, and S. Yan, “Proximal iteratively reweighted algorithm with multiple splitting for nonconvex sparsity optimization.,” in Proc. 28th AAAI Conf. Artif. Intell., pp. 1251–1257, 2014.
[38] T. Sun, H. Jiang, and L. Cheng, “Global convergence of proximal iteratively reweighted algorithm,” J. Global Optimiz., vol. 68, no. 4, pp. 815–826, 2017.
[39] P.-A. Absil, R. Mahony, and B. Andrews, “Convergence of the iterates of descent methods for analytic cost functions,” SIAM J. Optimiz., vol. 16, no. 2, pp. 531–547, 2005.
[40] T.-J. Yang, Y.-H. Chen, J. Emer, and V. Sze, “A method to estimate the energy consumption of deep neural networks,” in Proc. 51st Asilomar Conf. Signals, Syst. Comput., pp. 1916–1920, IEEE, 2017.
[41] T.-J. Yang, Y.-H. Chen, and V. Sze, “Designing energy-efficient convolutional neural networks using energy-aware pruning,” in Proc. IEEE Conf. Comput. Vision Pattern Recognition (CVPR), pp. 5687–5695, 2017.
[42] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proc. IEEE Conf. Comput. Vision Pattern Recognition (CVPR), pp. 1–9, 2015.
[43] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138, 2016.
[44] “CNN Energy Estimation Website.” http://energyestimation.mit.edu.
[45] Y. Shen, Y. Shi, J. Zhang, and K. B. Letaief, “LORM: Learning to optimize for resource management in wireless networks with few training samples,” IEEE Trans. Wireless Commun., vol. 19, no. 1, pp. 665–679, 2020.
[46] A. Wiesel, Y. C. Eldar, and S. Shamai, “Linear precoding via conic optimization for fixed mimo receivers,” IEEE Trans. Signal Process., vol. 54, no. 1, pp. 161–176, 2005.
[47] R. T. Rockafellar, Convex Analysis, vol. 36. Princeton Univ. Press, 2015.
[48] H. Dahrouj and W. Yu, “Coordinated beamforming for the multicell multi-antenna wireless system,” IEEE Trans. Wireless Commun., vol. 9, no. 5, pp. 1748–1759, 2010.

		$\displaystyle\quad\,\,\frac{1}{2\beta}\\|\bm{w}^{[i]}\odot\bm{\lambda}^{[i+1]}+\bm{\mu}^{[i+1]}\\|_{2}^{2}$		(53)
		$\displaystyle=\frac{1}{2\beta}\\|\bm{w}^{[i+1]}\!\odot\!\bm{\lambda}^{[i+1]}+\bm{\mu}^{[i+1]}+(\bm{w}^{[i]}-\bm{w}^{[i+1]})\!\odot\!\bm{\lambda}^{[i+1]}\\|_{2}^{2}$
		$\displaystyle=\frac{1}{2\beta}(\\|\bm{r}^{[i+1]}\\|_{2}^{2}+\\|(\bm{w}^{[i]}-\bm{w}^{[i+1]})\odot\bm{\lambda}^{[i+1]}\\|_{2}^{2}$
		$\displaystyle\quad+2\langle\bm{r}^{[i+1]},(\bm{w}^{[i]}-\bm{w}^{[i+1]})\odot\bm{\lambda}^{[i+1]}\rangle)$
		$\displaystyle=\frac{1}{2\beta}\\|\bm{r}^{[i+1]}\\|_{2}^{2}-\frac{1}{2\beta}\\|(\bm{w}^{[i]}-\bm{w}^{[i+1]})\odot\bm{\lambda}^{[i+1]}\\|_{2}^{2}$
		$\displaystyle\quad-\langle\bm{v}^{[i+1]}-\bm{v}^{[i]},(\bm{w}^{[i]}-\bm{w}^{[i+1]})\odot\bm{\lambda}^{[i+1]}\rangle$
		$\displaystyle\geq\frac{1}{2\beta}\\|\bm{r}^{[i+1]}\\|_{2}^{2}-\\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\\|_{2}\\|\bm{w}^{[i]}-\bm{w}^{[i+1]}\\|_{2}$
		$\displaystyle\quad-\frac{1}{2\beta}\\|\bm{w}^{[i]}-\bm{w}^{[i+1]}\\|_{2}^{2},$

		$\displaystyle\quad\,\,\\|\bm{r}^{[i+1]}\\|_{2}^{2}$		(55)
		$\displaystyle\leq 2\beta\Delta G(\bm{v}^{[i+1]};\bm{v}^{[i]})+\\|\bm{w}^{[i]}-\bm{w}^{[i+1]}\\|_{2}^{2}$
		$\displaystyle\quad+2\beta\\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\\|_{2}\\|\bm{w}^{[i]}-\bm{w}^{[i+1]}\\|_{2}$
		$\displaystyle\leq 2\beta\Delta G(\bm{v}^{[i+1]};\bm{v}^{[i]})+\kappa^{2}p^{4}\\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\\|_{2}^{2}$
		$\displaystyle\quad+2\beta\kappa p^{2}\\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\\|_{2}^{2}$
		$\displaystyle\leq 2\beta\Delta G(\bm{v}^{[i+1]};\bm{v}^{[i]})+(\kappa^{2}p^{4}+2\beta\kappa p^{2})\\|\bm{v}^{[i]}-\bm{v}^{[i+1]}\\|_{2}^{2}$
		$\displaystyle\leq 2\beta\Delta G(\bm{v}^{[i+1]};\bm{v}^{[i]})+\frac{2}{\beta}(\kappa^{2}p^{4}+2\beta\kappa p^{2})\Delta G(\bm{v}^{[i+1]};\bm{v}^{[i]})$
		$\displaystyle\leq\frac{2}{\beta}(\beta^{2}+2\beta\kappa p^{2}+\kappa^{2}p^{4})\Delta G(\bm{v}^{[i+1]};\bm{v}^{[i]})$
		$\displaystyle\leq\frac{2}{\beta}(\beta^{2}+2\beta\kappa p^{2}+\kappa^{2}p^{4})(J(\bm{v}^{[i]})-J(\bm{v}^{[i+1]})),$