Restructuring, Pruning, and Adjustment of Deep Models for Parallel Distributed Inference

Afshin Abdi, Saeed Rashidi, Faramarz Fekri, Tushar Krishna
School of Electrical and Computer Engineering
Georgia Institute of Technology, Atlanta, GA
{abdi,saeed.rashidi,fekri,tushar}@gatech.edu

Abstract

Using multiple nodes and parallel computing algorithms has become a principal tool to improve training and execution times of deep neural networks as well as effective collective intelligence in sensor networks. In this paper, we consider the parallel implementation of an already-trained deep model on multiple processing nodes (a.k.a. workers) where the deep model is divided into several parallel sub-models, each of which is executed by a worker. Since latency due to synchronization and data transfer among workers negatively impacts the performance of the parallel implementation, it is desirable to have minimum interdependency among parallel sub-models. To achieve this goal, we propose to rearrange the neurons in the neural network and partition them (without changing the general topology of the neural network), such that the interdependency among sub-models is minimized under the computations and communications constraints of the workers. We propose RePurpose, a layer-wise model restructuring and pruning technique that guarantees the performance of the overall parallelized model. To efficiently apply RePurpose, we propose an approach based on $\ell_{0}$ optimization and the Munkres assignment algorithm. We show that, compared to the existing methods, RePurpose significantly improves the efficiency of the distributed inference via parallel implementation, both in terms of communication and computational complexity.

1 Introduction

In recent years, the size and complexity of deep neural networks has been increased significantly in terms of model’s structure and number of parameters. Consequently, real-time implementation and inference in many machine learning (ML) problems has become a challenging task. Although the execution time of deep neural networks can be improved significantly by the application of parallel computing algorithms and using multiple processing units (such as GPU’s or clusters of computing nodes), it generally requires synchronization and data exchange among processing units to some extent. This is mainly due to the fact that in parallel computations, each processing unit performs a portion of the computations, its inputs generally depend on the outputs from other units, and the results of computations should be aggregated to yield the desired output. These co-dependencies can lead to significant delays in computations. Moreover, in some real-world scenarios, such as sensor networks, the inference is done on the data observed by the entire network, i.e., each node in the network only observes part of the data. However, transferring all data to a central powerful node to aggregate and perform the ML task is undesirable due to the sheer amount of data to be collected, limited computational power, privacy concerns, or even availability of such a node. Hence, it is more favorable to develop a distributed equivalence of a deep model for deploying over the processors/sensor network.

In the aforementioned applications, straightforward parallel computing algorithms cannot be arbitrarily scaled up for deep models with complex connectivity structures. The majority of past works on distributed/parallel execution of deep neural networks are concerned with algorithmic aspects of the parallel implementation of the neural network (e.g., Zinkevich et al. (2010); Chung et al. (2014); De Grazia et al. (2012)). However, here, we focus on the structure of deep models and how we can modify it for efficient parallel distributed implementation.

In recent years, there has been an increasing interest in compressing, pruning, or modifying the structure of deep models to reduce their computational or storage costs, while keeping the accuracy or performance of the modified model acceptable. The majority of these approaches can be classified into three categories:

•

Knowledge Distillation to train a shallow or small model (referred to as student network) that mimics the behavior of an already trained complex model (a.k.a. teacher network) or an ensemble of teacher networks (see e.g., Hinton et al. (2015); Romero et al. (2015); Zagoruyko and Komodakis (2017)).
•

Using Structured Parameters to reduce the size of deep model or its processing time. Examples include using circulant matrices Cheng et al. (2015) or Adaptive Fastfood transform Yang et al. (2015) for fully connected layers, and separable filters Rigamonti et al. (2013) or low-rank tensor decomposition Tai et al. (2016) for convolutional layers.
•

Pruning Parameters has been used extensively to reduce the complexity of the model as well as over-parametrization. $\ell_{1}$ or $\ell_{0}$ regularization Louizos et al. (2018), and group-sparsity Zhou et al. (2016); Wen et al. (2016) have been successfully used to promote sparsity of the parameters during training. Model pruning algorithms such as Optimal Brain Damage Cun et al. (1990), Optimal Brain Surgeon Hassibi et al. (1993), hard-thresholding the parameters Han et al. (2015), and similar works Castellano et al. (1997); Leung et al. (2001), mainly focus on removing the insignificant edges or neurons, by considering the magnitude of the weights or their approximate Hessian matrix as a measure of importance. More recently, Aghasi, et al. Aghasi et al. (2017, 2020) proposed Net-Trim, a convex optimization technique to prune the parameters of the deep model by analyzing the signals in the neural network.

Although it is possible to design deep models according to the capability and constraints of the processing system, following such an approach requires training a new deep model for every target hardware which is infeasible or demanding in many ML problems. Further, imposing a possibly unnecessary structure in advance during training a deep model would likely be limiting in terms of model performance and accuracy. It will be also an undesirable approach for parallel implementation since a model specifically designed for optimum implementation on a target platform or architecture may be far from optimum on other platforms (e.g., GPUs with different compute capabilities, or CPU vs GPU vs sensor network). Hence optimizing and fixing the structure for one particular parallel distributed setting in advance would limit the optimal deployment on other platforms. As a result, we assume that a complex deep model has already been trained with minimum or no hardware-specific constraints on its parameters or structure. Our goal would be readjusting the model via restructuring the layers and manipulating the parameters of the neural network without changing its general topology for more efficient parallel implementation.

\begin{overpic}[width=433.62pt]{Figures/ModelRestructuring_Example.pdf} \put(4.0,-5.0){\scalebox{0.9}{(a) Original model}} \put(52.0,-5.0){\scalebox{0.9}{(b) Restructured model}} \end{overpic}

Figure 1: Restructuring a neural network to reduce communication between processing units

As an example, consider the simple neural network in Fig. 1(a). Simply partitioning the model into two sub-models (as depicted by a dashed line in the Fig. 1(a)) imposes lots of communication between the two partitions. However, by rearranging the neurons properly, the co-dependency (and hence required communications) between the two sub-models (the red edges in Fig. 1(b)) is reduced substantially. It is worth mentioning that there are approximately $\mathcal{O}(P^{N})$ different partitioning to distribute computations of a neural network’s layer with $N$ neurons over $P$ workers. Hence, enumerating all such possibilities and choosing a good one is infeasible specially for large networks. In this paper, we propose a systematic approach to perform such partitioning and parameter adjustment to ensure efficient implementation of the modified model while keeping its accuracy close to the original model.

Notations

$P$	Number of workers
$N$	Number of neurons
$L$	Number of layers
$\bm{W}$	Weight matrix
$\bm{b}$	Bias
$\bm{M}$	Mask matrix

Bold lowercase letters represent vectors and the $i$ -th element of the vector $\bm{x}$ is denoted as $x_{i}$ . Matrices are denoted by bold capital letters such as $\bm{X}$ , with the $(i,j)$ -th element represented by $X_{i,j}$ or $[\bm{X}]_{i,j}$ . $\bm{A}\odot\bm{B}$ is the Hadamard (element-wise) product of $\bm{A}$ and $\bm{B}$ . $\|\bm{X}\|_{F}$ is the Frobenius norm of $\bm{X}$ , $\|\bm{x}\|_{2}$ and $\|\bm{x}\|_{0}$ are the $\ell_{2}$ and $\ell_{0}$ norms of $\bm{x}$ , respectively. $\bm{1}$ is a vector or matrix of all ones, whose size would be clear from the context.

2 Problem Statement and our Approach

Consider the problem of parallel distributed implementation of a trained deep neural network over $P$ processing units (hereafter referred to as workers), where the deep model is divided into $P$ sub-models, each of which is executed by a worker. As managing the synchronization and data transfer among workers degrades the efficiency of the parallel implementation (e.g., higher latency), it is crucial to reduce the communication among workers. The communication is needed between the workers when the input of a neuron in a sub-model is from the output of a neuron belonging to a different sub-model which resides in another worker. These co-dependencies can lead to significant delays in computation.

For the sake of simplicity in presentations and analysis, here, we mainly focus on feedforward deep models, specifically fully-connected layers. Note that the convolution layer can be represented as a special case of a fully connected layer. ¹¹1Recall that the convolution $\bm{h}*\bm{x}$ can be represented as $\bm{W}^{\mathsf{T}}\bm{x}$ for a circulant matrix $\bm{W}$ constructed from $\bm{h}$ . For more details and the extensions of our approach to other complex architectures, please refer to the supplementary document.

Consider an arbitrary neural network with $L$ layers and parameters $\{\bm{\theta}^{(l)}\}_{l=1}^{L}$ , where $\bm{\theta}^{(l)}=\{\bm{W}^{(l)},\bm{b}^{(l)}\}$ is the parameters of the $l$ -th layer. Let $\bm{x}^{(l)}$ be the input signal to the $l$ -th layer. Then, the output of the layer (input to the next layer) would be given by

\bm{y}^{(l)}=(\bm{W}^{(l)})^{\mathsf{T}}\bm{x}^{(l)}+\bm{b}^{(l)},\quad\quad\bm{x}^{(l+1)}=\sigma(\bm{y}^{(l)}),

(1)

where $\sigma(\cdot)$ is the activation function.

\begin{overpic}[width=390.25534pt]{Figures/problem_formulation.pdf} \put(0.0,30.0){\scalebox{0.9}{$\bm{x}_{1}$}} \put(0.0,13.0){\scalebox{0.9}{$\bm{x}_{2}$}} \put(32.0,33.0){\scalebox{0.9}{$\bm{y}_{1}$}} \put(32.0,10.0){\scalebox{0.9}{$\bm{y}_{2}$}} \put(45.0,10.0){\scalebox{0.8}{${\color[rgb]{0.64,0.48,0}\bm{y}_{1}}={\color[rgb]{0.64,0.48,0}\bm{W}_{11}^{\mathsf{T}}\bm{x}_{1}+\bm{b}_{1}}+{\color[rgb]{0.75,0,0}\bm{W}_{12}^{\mathsf{T}}\bm{x}_{2}}$}} \put(45.0,2.0){\scalebox{0.8}{${\color[rgb]{0,0.5,0.22}\bm{y}_{2}}={\color[rgb]{0,0.5,0.22}\bm{W}_{22}^{\mathsf{T}}\bm{x}_{2}+\bm{b}_{2}}+{\color[rgb]{0.75,0,0}\bm{W}_{21}^{\mathsf{T}}\bm{x}_{1}}$}} \end{overpic}

Figure 2: Communication between workers in parallel execution of a model over two workers. The intra-worker computations are denoted by yellow and green connections, while required communication between the workers are denoted by red edges. The binary mask matrix (right image) can be used to determine the edges between the two workers.

To analyze the bottlenecks, consider an arbitrary layer with input $\bm{x}$ , and parameters $\bm{W}$ and $\bm{b}$ (Fig. 2). Hence, $\bm{y}=\bm{W}^{\mathsf{T}}\bm{x}+\bm{b}$ would be the input signal to the neurons of the layer. Suppose that $\bm{x}_{k}$ and $\bm{y}_{k}$ are subsets of the signals that are processed by the $k$ -th worker. Without loss of generality, we assume that the neurons are ordered such that the $k$ -th block of consecutive neurons belongs to the $k$ -th sub-model, i.e., $\bm{x}=[\bm{x}_{1};\bm{x}_{2};\ldots;\bm{x}_{P}]$ . By partitioning $\bm{W}$ and $\bm{b}$ accordingly, we observe that

\bm{y}_{k}=(\bm{W}_{k,k}^{\mathsf{T}}\bm{x}_{k}+\bm{b}_{k})+{\color[rgb]{0.75,0,0}(\sum_{l\neq k}\bm{W}_{k,l}^{\mathsf{T}}\bm{x}_{l})}.

(2)

Note that the first term can be computed at the $k$ -th worker independent of the others, whereas computing the second term requires synchronization and communication from the other workers. Hence, to reduce the dependency among workers and the communication cost, we consider minimizing the number of non-zero elements in $\bm{W}_{k,l}$ , for $l\neq k$ .

By defining an appropriate binary mask $\bm{M}$ (Fig. 2 (right)), the connections between sub-models can be determined by the non-zero elements of $\bm{M}\odot\bm{W}$ . In general, if $\iota_{k}$ and $o_{k}$ are the number of input and output neurons assigned to the $k$ -th worker, then $\bm{M}$ is an anti-diagonal block matrix, given by

\bm{M}=1-\operatorname{diag}\left(\bm{1}_{\iota_{{}_{1}}\times o_{{}_{1}}},\ldots,\bm{1}_{\iota_{{}_{P}}\times o_{{}_{P}}}\right).

Remark 1.

Note that the bias $\bm{b}$ does not contribute to the communication between workers and can be safely ignored in computing the cost. Further, $\|\bm{M}\odot\bm{W}\|_{0}$ can be viewed as the number of edges between sub-models, and be used as an approximation to the latency caused by the communication and synchronization among workers. Similarly, by defining an appropriate binary mask $\bm{M}_{ij}$ , we can find the edges from worker $j$ to $i$ from the non-zero entries of $\bm{V}_{ij}:=\bm{M}_{ij}\odot\bm{W}$ . Depending on the communication protocol among workers, the number of non-zero edges, number of non-zero rows, or number of non-zero columns of $\bm{V}_{ij}$ can be interpreted as a measure of latency due to the communication from worker $j$ to $i$ . For the sake of simplicity, in this work, we consider $\|\bm{M}\odot\bm{W}\|_{0}$ as a measure of total communication latency. However, the extensions of our proposed approach to other cases is straightforward.

To reduce the communication, one may attempt to reduce the number of cross-edges among sub-models. However, as we observed in our experiments, generally there are many important connections between neurons from different sub-models, and removing those connections can severely affect the performance of the neural network. Hence, it is important to have such neurons in the same sub-model. On the other hand, the problem of neuron assignment to the workers is combinatorial and discrete with complexity $\mathcal{O}(P^{N})$ for a layer with $N$ neurons and $P$ workers. Hence, enumerating all possibilities or using ordinary optimization techniques as well as genetic algorithms or simulated annealing would fail due to the complex nature of interactions among neurons in a deep neural network. Based on the above observations, we devise RePurpose, a layer-wise neural network restructuring and pruning for efficient parallel implementation. The gist of the idea is as follows;

The neurons of the input layer are assigned to the sub-models based on each worker’s computational power and/or structure of the input data. For example, in a sensor network, it is dictated by the input of each sensor. Next, we restructure and adjust the neural network, sequentially one layer at a time. For the $l$ -th layer, the assignments of the neurons in layer $l-1$ are assumed to be fixed and known from the previous steps. The neurons in layer $l$ are rearranged and assigned to each sub-model, and the parameters of the layer are pruned and fine-tuned, such that ( $i$ ) the performance of the modified neural network is close to the original one, and ( $ii$ ) the communication between the sub-models (measured by the number of edges connecting neurons from different sub-models) is minimized.

3 RePurpose: Restructuring and Pruning Deep Models

Refer to caption — Figure 3: Rearranging neurons of a layer and adjusting parameters such that the $k$ -th block of signals, $\widehat{\bm{y}}_{k}$ , is processed at the $k$ -th worker.

Consider the $l$ -th layer of neural network and assume that the neurons in the previous layers have already been partitioned and rearranged, i.e., the input of the layer is partitioned as $[\bm{x}_{1};\ldots;\bm{x}_{P}]$ , where $\bm{x}_{k}$ is computed at the $k$ -th worker. Let $\bm{y}$ and $\bm{W}$ be the signals and parameters of the $l$ -th layer in the original model. RePurpose rearranges the neurons such that the $k$ -th block of neurons are being assigned to the $k$ -th worker (Fig. 3). Note that the rearrangement of the neurons can be captured via a permutation matrix $\bm{\Pi}$ . Hence, if we use the same weights, the effect of neuron-rearrangement can be formulated as $\widehat{\bm{y}}=\bm{\Pi y}$ and $\widehat{\bm{W}}=\bm{W\Pi}^{\mathsf{T}}$ , and the number of cross-edges between workers would be $\|\bm{M}\odot\widehat{\bm{W}}\|_{0}$ . To further reduce the communication between workers, RePurpose not only rearranges the neurons, but it also prunes and adjusts $\widehat{\bm{W}}$ . Hence, the optimization problem for RePurpose is formulated as

\min_{\widehat{\bm{W}},\bm{\Pi}}\|\bm{M}\odot\widehat{\bm{W}}\|_{0}\quad\quad\operatorname*{s.t.}~{}\|\widehat{\bm{W}}-\bm{W\Pi}^{\mathsf{T}}\|_{F}^{2}\leq\epsilon,

(3)

where $\epsilon$ is a parameter controlling the closeness of the parameters. Directly solving (3) is infeasible as it is (mixed-)discrete, non-convex, and there are $N!$ different permutation matrices. In the following, we propose an alternative and efficient approach to solve (3).

Recall that if neuron $i$ is assigned to worker $j$ , the signal at that neuron can be rewritten as $\hat{y}_{i}=b_{i}+\widehat{\bm{w}}_{i}^{\mathsf{T}}\bm{x}=b_{i}+\widehat{\bm{w}}_{ij}^{\mathsf{T}}\bm{x}_{j}+\sum_{k\neq j}\widehat{\bm{w}}_{ik}^{\mathsf{T}}\bm{x}_{k}$ , where $\widehat{\bm{w}}_{i}$ is the $i$ -th column of $\widehat{\bm{W}}$ , and $\widehat{\bm{w}}_{ik}$ is the $k$ -th block of $\widehat{\bm{w}}_{i}$ corresponding to $\bm{x}_{k}$ . Hence, the communication cost from other workers to worker $j$ would be $\|\widehat{\bm{w}}_{i,\backslash j}\|_{0}:=\sum_{k\neq j}\|\widehat{\bm{w}}_{ik}\|_{0}$ . By incorporating an additional optional cost to encourage the total sparsity of the parameters, $\|\widehat{\bm{w}}_{i}\|_{0}$ , the cost of assigning neuron $i$ to worker $j$ would be

c_{ji}=\min_{\widehat{\bm{w}}_{i}}\|\bm{w}_{i}-\widehat{\bm{w}}_{i}\|_{2}^{2}+\eta_{1}\|\widehat{\bm{w}}_{i}\|_{0}+\eta_{2}\|\widehat{\bm{w}}_{i,\backslash j}\|_{0},

(4)

where $\eta_{1}$ and $\eta_{2}$ control the trade-off between the error, sparsity, and cross-communication.

Lemma 1.

The solution of (4) is given by element-wise hard-thresholding $\bm{w}_{i}$ , i.e.,

[\widehat{\bm{w}}_{i}]_{n}=\left\{\begin{array}[]{ll}0&|\left[\bm{w}_{i}\right]_{n}|\leq\sqrt{\eta}\\ \left[\bm{w}_{i}\right]_{n}&\textup{o.w.}\end{array}\right.

(5)

where $\eta=\eta_{1}$ or $\eta_{1}+\eta_{2}$ , depending on whether neuron $n$ from the previous layer has been assigned to the $j$ -th worker or not.

Function RePurpose( $\bm{W}$ , $\{n_{k}\}_{k=1}^{P}$ , $\eta_{1}$ , $\eta_{2}$ ):

- Compute the cost matrix

\bm{C}

, where

[\bm{C}]_{j,i}

is calculated via (4) and (5)

- Construct

\mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.59721pt}{\hskip 1.29167pt$\widetildesym$}}}}{\bm{C}}}{\accentset{\textstyle\text{\smash{\raisebox{-5.59721pt}{\hskip 1.29167pt$\widetildesym$}}}}{\bm{C}}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.91806pt}{\hskip 0.90417pt$\widetildesym$}}}}{\bm{C}}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.7986pt}{\hskip 0.64583pt$\widetildesym$}}}}{\bm{C}}}

by repeating the

k

-th row of

\bm{C}

n_{k}

times.

(I,J)=\textsc{Munkres}(\mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.59721pt}{\hskip 1.29167pt$\widetildesym$}}}}{\bm{C}}}{\accentset{\textstyle\text{\smash{\raisebox{-5.59721pt}{\hskip 1.29167pt$\widetildesym$}}}}{\bm{C}}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.91806pt}{\hskip 0.90417pt$\widetildesym$}}}}{\bm{C}}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.7986pt}{\hskip 0.64583pt$\widetildesym$}}}}{\bm{C}}})

- Define permutation matrix as

\bm{\Pi}_{I,J}=1

Return

\bm{\Pi}

Algorithm 1 Parameter-Space RePurpose

Restructuring and neuron assignment can be interpreted as selecting elements from the cost matrix $\bm{C}$ , whose $(j,i)$ -th element is given by (4), such that ( $i$ ) from row $k$ , $n_{k}$ elements are selected, i.e., $n_{k}$ neurons are assigned to worker $k$ , ( $ii$ ) from each column, only one element is selected, i.e., each neuron can be assigned to only one worker, and ( $iii$ ) the sum of selected elements is minimized, i.e., the total cost of neuron assignment and parameter adjustment is minimum.

Algorithm 1 summarizes the proposed solution, where Munkres( $\cdot$ ) uses the Munkres assignment algorithm Kuhn (1955); Munkres (1957) to find the (row-column) index pairs that minimizes the total sum cost $\sum_{n}[\mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.59721pt}{\hskip 1.29167pt$\widetildesym$}}}}{\bm{C}}}{\accentset{\textstyle\text{\smash{\raisebox{-5.59721pt}{\hskip 1.29167pt$\widetildesym$}}}}{\bm{C}}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.91806pt}{\hskip 0.90417pt$\widetildesym$}}}}{\bm{C}}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.7986pt}{\hskip 0.64583pt$\widetildesym$}}}}{\bm{C}}}]_{{}_{I_{n},J_{n}}}$ .

Theorem 2.

Algorithm 1 finds the optimum solution of

\|\widehat{\bm{W}}-\bm{W\Pi}^{\mathsf{T}}\|_{F}^{2}+\eta_{1}\|\widehat{\bm{W}}\|_{0}+\eta_{2}\|\bm{M}\odot\widehat{\bm{W}}\|_{0},

(6)

with time complexity $\mathcal{O}(N^{3})$ , where $N$ is the number of layer’s neurons (number of columns of $\bm{W}$ ).

Note that by setting $\eta_{1}=0$ , (6) would be the Lagrangian of (3) and choosing appropriate value for $\eta_{2}$ can lead to the desired error bound $\|\widehat{\bm{W}}-\bm{W\Pi}^{\mathsf{T}}\|_{F}^{2}\leq\epsilon$ . Finally, it is worth mentioning that the bias term does not contribute to the communication cost and is given by $\widehat{\bm{b}}=\bm{\Pi b}$ .

Remark 2.

In model pruning and compression, it is common to retrain the modified model to fine-tune the parameters and improve the accuracy of the model. This extra post-processing is generally referred to as post-training phase or fine-tuning. The same principle can be applied to our proposed algorithm.

4 Experiments

To evaluate the performance of the RePurpose algorithm, we consider different neural network architectures and compare the accuracy, communication and wall-clock times w.r.t. naive implementation where the input data is communicated to all nodes in the network, so they all have the entire input data, baseline which is parallel implementation of the deep model without any modification to the parameters or structures, and sparse implementation which sparsifies the parameters to reduce cross-edges between the workers without re-arranging the neurons. We evaluate the accuracy-communication trade-off in different sensor networks, as well as the reduction in total computation time (wall-clock time) in Edge and Data Center platforms.

4.1 Sensor Network

Setup 1. Figure 6(a) shows a 2 sensors network, sensor $i$ observes location $x_{i}$ of a target object and each sensor’s task is to determine whether the object is in the blue or green region. A simple neural network (Fig. 6(b)) is trained at a central node to perform the task with accuracy $94.5\%$ . In the naive approach, the sensors exchange their observations ( $x_{i}$ ’s) and run the inference (NN) independently. Hence, the NN is executed twice throughout the network at the cost of higher computational complexity. Alternatively, we can apply RePurpose to efficiently distribute the inference over the sensors. We applied RePurpose with $\eta_{1}=0$ , $\eta_{2}=0.01$ (Fig. 6(c)), and $\eta_{2}=0.1$ (Fig. 6(d)). As a result, the number of cross-worker communications reduced significantly to $1.7\%$ , $1.5\%$ and $1.6\%$ for $\eta_{2}=0.01$ , and $0.7\%$ , $0.1\%$ and $0.3\%$ for $\eta_{2}=0.1$ for layers 1, 2, and 3, respectively. Specifically, with only $6$ communicated values, the computational complexity at each sensor is reduced by almost a factor of $4$ compared to the naive implementation. However, the accuracy of the distributed parallel model, prior to the post-training phase, is reduced to $93.5\%$ . By retraining the modified model for few iterations (and imposing the structural constraints found through RePurpose), the accuracy of the fine-tuned model becomes $94.4\%$ .

Setup 2. Next, we consider a network of $P$ sensors where each sensor observes an image of a digit $x_{i}$ (from MNIST dataset) and the goal is finding the rounded average $\left[\frac{\sum_{i}x_{i}}{P}\right]$ . We adapted a Lenet-5 like structure LeCun et al. (1998) for the neural network which is trained in a central server (Fig. 7), and repeated the experiments several times. Note that one might attempt to classify the digits at each individual sensor and then share the value with other nodes to compute the average. However, in addition to the increased computational complexity at each individual node, it is worth mentioning that if the accuracy of digit recognition is $\rho$ , close to $1$ , then the final accuracy in computing the average would be approximately $\frac{1+8\rho^{P}}{9}$ . For example, for a network with $6$ nodes and $\rho=0.98$ , the final accuracy would be less than $90\%$ . We applied the RePurpose algorithm on the trained model for distributed inference over the sensor network with different communication (cross-worker edges) constraints. Fig. 6 compares the results of RePurpose with the baseline and direct sparsification, in a network with $P=6$ sensors.

Setup 3. Next, we consider $P$ sensors (cameras) that observe a scene and detect whether an specific object exists or not. For this purpose, we used a Resnet-like neural network He et al. (2016) over CIFAR10 and the objective is detecting the presence of a ”dog” in any of the images (Fig. 8). Fig. 6 shows the results of RePurpose, the baseline, and direct sparsification, in a network with $P=2$ sensors.

As seen from figures 4(a) and 5(a), RePurpose significantly outperforms sparsification and although its accuracy is dropped for large $\eta_{2}$ , with 1 or 10 epochs of post-training for MNIST and CIFAR10, respectively, (”FT RePurpose” in the figures) it achieves almost the same accuracy as the original model, while direct sparsification fails to provide good accuracy. Moreover, interestingly, RePurpose sparsifies the cross-edges between workers significantly for the hidden layers. The restructured model can achieve the same performance as the original model by using less than $0.0003$ of the cross-edges (i.e., between $10$ to $30$ connections out of more than $100000$ edges between workers). Finally, figures 4(c) and 5(c) compare the accuracy vs the cross-communication between workers. Clearly, direct sparsification performs well only when there are enough number of cross-edges between the workers, while the accuracy of the model obtained by RePurpose does not change for a vast sparsity range.

Finally, it is worth mentioning that in the naive approach to inference over the sensor network, each node has to transmit its observations to other nodes, hence the communication between any two pair of nodes would be $784$ or $1024$ values for Setups 2 and 3, respectively. However, RePurpose can achieve the same accuracy with less than 200 total communicated values across the entire network.

4.2 System Evaluations

Table 1: Target Accelerator Evaluation Platforms

Name	Node Compute	Node Memory	Network Bandwidth	Number of Nodes
Datacenter	125 TOPS	4GB	150 GB/s (NVLink)	1-32
Edge	0.5 TOPS	1GB	100 MB/s (Ethernet)	1-32

Methodology- We evaluate RePurpose on two distributed accelerator platforms, described in Tbl. LABEL:table:system_config, simulated using ASTRA-sim Rashidi et al. (2020). ASTRA-sim is an open-source distributed Deep Learning platform simulator that models cycle-level communication behavior in details for any partitioning strategy across multiple interconnected accelerator nodes. ASTRA-sim takes the compute cycles for each layer of the model as an external input, and manages communication scheduling similar to communication libraries like NVIDIA NCCL NVIDIA (2018). We obtained compute cycles for the Datacenter configuration from a NVIDIA V100 GPU implementation, and for the Edge configuration (e.g., sensor network) from a separate DNN accelerator simulator Samajdar et al. (2020).

We tried to stress the aforementioned platforms under various sized problems to show the efficiency of RePurpose. In all models, we assumed a stack of 5 layers with same number of neurons. In our notation, $N$ refers to the number of neurons per layer (or matrix dimensions). For the datacenter system, $N$ varies from $1K$ to $1M$ , while for edge system the variation is from $1K$ to $32K$ . We also assumed strict ordering between current communication and computation of next layer, meaning that each node begins computation of each layer only when it has all inputs available.

We picked 4 different flavors of RePurpose with 50%, 75%, 90% and 99% sparsity factor named as RP-50, RP-75, RP-90, and RP-99, respectively. In addition, we changed the number of worker nodes from 1 to 32 for both system configurations.

Results- Fig. 9 shows the total amount of data that each node needs to send out for one input sample for $N=8K$ . Clearly, specification has the linear effect on the amount of communicating data. On the other hand, partitioning across more nodes also increases the total communicating data. But the increase in rate diminishes as the number of nodes increases, converges to 2X more data compared to the case of 2 nodes.

To further investigate the effect of RePurpose in reducing the computation and communication times, Fig. 11 shows the simulation results of the communication and computation breakdown for the baseline system and RePurpose for $N=8k$ . As seen from Fig. 10(a), in a datacenter system, on average and across different number of nodes, RP-50, RP-75, RP-90 and RP-99 achieve 1.7 $\times$ , 2.76 $\times$ , 4.77 $\times$ and 10.47 $\times$ speed-up in computations, respectively. The average improvement for communication ratio is 1.2 $\times$ , 1.45 $\times$ , 1.74 $\times$ and 1.75 $\times$ , respectively. The reason for lower improvements of communication time is that due to NVLink’s high bandwidth. For $N=8K$ , network communication time is mostly network latency limited. Hence, reduction in input size does not correspond to linear reduction in communication time.

Fig. 10(b) shows the similar results but for edge system. Here, due to much lower network bandwidth, the effect of communication is more considerable. On average applying RP-50, RP-75, RP-90 and RP-99 improve computation times by 1.7 $\times$ , 2.77 $\times$ , 4.78 $\times$ and 11.01 $\times$ , respectively. This value for communication is 1.2 $\times$ , 1.38 $\times$ , 1.82 $\times$ and 3.04 $\times$ respectively. As the number of nodes grow, the communication gap between the baseline and RePurpose decreases. This is mostly because of the congestion in the network (e.g. switch) that decreases the effect of benefits gained by RePurpose.

Fig. 11 shows how communication, computation and total times change as the the number of neurons grows. For each network size, computation and communication times are averaged across different sparsity factors and node counts. For datacenter system (Fig10(c)), computation is the dominant factor. This is expected since the computation grows as $O(N^{2})$ while communication increases as $O(N)$ . Since the network band-width is very high in datacenter, the effect of communication is negligible. In general, the total time ratio increase from 1.01 $\times$ in $N=1K$ to 2.06 $\times$ in $N=1M$ . On the other hand, communication remains a considerable factor in the edge systems (Fig. 10(d)) due to: ( $i$ ) low network bandwidth, and ( $ii$ ) lower dimensions of workloads on edge systems. The total time improvement for edge system is $1.55\times$ for $N=1K$ and it increases to $3.8\times$ for $N=32K$ .

5 Conclusion

In this paper, we considered the problem of efficient parallel distributed inference of an already trained deep model over a cluster of processing units or a sensor network. Required communication and synchronization among processing units or network nodes (i.e., workers) can adversely affect the computation time. Moreover, in the wireless sensor networks, it may significantly increase the power consumption due to the transmission of large amount of data. We claimed that traditional approaches to prune or compress the deep models fail to consider the constraints imposed in such distributed inference systems. To overcome the shortcomings of the existing methods, we devised RePurpose, a framework to restructure the deep model by rearranging the neurons, optimum assignment of neurons to the workers, and then pruning the parameters, such that the dependency among workers is reduced. We showed that RePurpose can significantly reduce the number of cross-communication between workers and improve the computation time significantly, while the performance loss of the modified model is remained negligible.

References

Aghasi et al. [2017] Alireza Aghasi, Afshin Abdi, Nam Nguyen, and Justin Romberg. Net-Trim: Convex Pruning of Deep Neural Networks with Performance Guarantee. In I Guyon, U V Luxburg, S Bengio, H Wallach, R Fergus, S Vishwanathan, and R Garnett, editors, Advances in Neural Information Processing Systems 30, pages 3180–3189. Curran Associates, Inc., 2017.
Aghasi et al. [2020] Alireza Aghasi, Afshin Abdi, and Justin Romberg. Fast convex pruning of deep neural networks. SIAM Journal on Mathematics of Data Science, 2(1):158–188, 2020.
Castellano et al. [1997] Giovanna Castellano, Anna Maria Fanelli, and Marcello Pelillo. An iterative pruning algorithm for feedforward neural networks. IEEE Transactions on Neural Networks, 8(3):519–531, may 1997. ISSN 1045-9227. doi: 10.1109/72.572092.
Cheng et al. [2015] Yu Cheng, X Yu Felix, Rogerio S Feris, Sanjiv Kumar, Alok Choudhary, and Shih-Fu Chang. Fast neural networks with circulant projections. arXiv preprint arXiv:1502.03436, 2015.
Chung et al. [2014] I-Hsin Hsin Chung, Tara N. Sainath, Bhuvana Ramabhadran, Michael Picheny, John Gunnels, Vernon Austel, Upendra Chauhari, and Brian Kingsbury. Parallel Deep Neural Network Training for Big Data on Blue Gene/Q. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, volume 28 of SC ’14, pages 745–753, Piscataway, NJ, USA, 2014. IEEE Press. ISBN 978-1-4799-5500-8. doi: 10.1109/TPDS.2016.2626289.
Cun et al. [1990] Yann Le Cun, John S Denker, Sara A Sola, T Bell Laboratories, and Sara A Solla. Optimal Brain Damage. In Advances in Neural Information Processing Systems 2, pages 598–605, San Francisco, CA, USA, 1990. Morgan Kaufmann Publishers Inc. ISBN 1-55860-100-7.
De Grazia et al. [2012] Michele De Filippo De Grazia, Ivilin Stoianov, and Marco Zorzi. Parallelization of deep networks. Proceedings of 2012 European Symposium on Artificial NN, Computational Intelligence and Machine Learning, pages 621–626, 2012.
Han et al. [2015] Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both Weights and Connections for Efficient Neural Networks. CoRR, abs/1506.02626:1–9, 2015. URL http://arxiv.org/abs/1506.02626.
Hassibi et al. [1993] Babak Hassibi, David G Stork, Sand Hill Road, and Menlo Park. Second Order Derivatives for Network Pruning: Optimal Brain Surgeon. In Advances in Neural Information Processing Systems 5, [NIPS Conference], pages 164–171, San Francisco, CA, USA, 1993. Morgan Kaufmann Publishers Inc. ISBN 1-55860-274-7. URL http://dl.acm.org/citation.cfm?id=645753.668069.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Hinton et al. [2015] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the Knowledge in a Neural Network. ArXiv e-prints, pages 1–9, mar 2015. URL http://arxiv.org/abs/1503.02531.
Jonker and Volgenant [1987] Roy Jonker and Anton Volgenant. A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing, 38(4):325–340, 1987.
Kuhn [1955] H. W. Kuhn. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1-2):83–97, mar 1955. doi: 10.1002/nav.3800020109.
LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
Leung et al. [2001] Chi-Sing Sing Leung, Kwok-Wo Wo Wong, Pui-Fai Fai Sum, and Lai-Wan Wan Chan. A pruning method for the recursive least squared algorithm. Neural Networks, 14(2):147–174, 2001. ISSN 0893-6080. doi: http://dx.doi.org/10.1016/S0893-6080(00)00093-9.
Louizos et al. [2018] Christos Louizos, Max Welling, and Diederik P. Kingma. Learning Sparse Neural Networks through $\ell_{0}$ Regularization. In ICLR, pages 1–13, 2018.
Munkres [1957] James Munkres. Algorithms for the assignment and transportation problems. Journal of the Society for Industrial and Applied Mathematics, 5(1):32–38, mar 1957. doi: 10.1137/0105003.
NVIDIA [2018] NVIDIA. Nvidia collective communications library (nccl), 2018. URL https://developer.nvidia.com/nccl.
Rashidi et al. [2020] Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna. ASTRA-SIM: Enabling SW/HW Co-Design Exploration for Distributed DL Training Platforms. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2020, 2020.
Rigamonti et al. [2013] Roberto Rigamonti, Amos Sironi, Vincent Lepetit, and Pascal Fua. Learning separable filters. In 2013 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, jun 2013. doi: 10.1109/cvpr.2013.355.
Romero et al. [2015] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. In ICLR, 2015.
Samajdar et al. [2020] Ananda Samajdar, Jan Moritz Joseph, Yuhao Zhu, Paul Whatmough, Matthew Mattina, and Tushar Krishna. A systematic methodology for characterizing scalability of dnn accelerators using scale-sim. In 2020 IEEE International Symposium on Performance Analysis of Systems and Software, 2020.
Tai et al. [2016] Cheng Tai, Tong Xiao, Yi Zhang, Xiaogang Wang, et al. Convolutional neural networks with low-rank regularization. In ICLR, 2016.
Tomizawa [1971] N. Tomizawa. On some techniques useful for solution of transportation network problems. Networks, 1(2):173–194, 1971.
Wen et al. [2016] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, pages 2074–2082, 2016. ISBN 1878-3686 (Electronic). doi: 10.1016/j.ccr.2008.06.009.
Yang et al. [2015] Zichao Yang, Marcin Moczulski, Misha Denil, Nando De Freitas, Alex Smola, Le Song, Ziyu Wang, Nando de Freitas, Alex Smola, Le Song, and Ziyu Wang. Deep Fried Convnets. In The IEEE International Conference on Computer Vision (ICCV), pages 1476–1483, dec 2015. ISBN 9781467383912. doi: 10.1109/ICCV.2015.173.
Zagoruyko and Komodakis [2017] Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In ICLR, 2017.
Zhou et al. [2016] Hao Zhou, Jose M Alvarez, and Fatih Porikli. Less is more: Towards compact cnns. In European Conference on Computer Vision, pages 662–677. Springer, 2016.
Zinkevich et al. [2010] Martin A Zinkevich, Alex J Smola, Markus Weimer, Lihong Li, and Alex J Smola. Parallelized Stochastic Gradient Descent. In J D Lafferty, C K I Williams, J Shawe-Taylor, R S Zemel, and A Culotta, editors, Advances in Neural Information Processing Systems 23, pages 2595–2603. Curran Associates, Inc., 2010.

Appendix A Complexity of Naive Direct Partitioning

Consider distributing processing of a layer of a deep neural network with $N$ neurons over $P$ workers. Without assuming any constraint on the number of neurons per worker, there are $P$ possible assignments for each neuron, hence, the total possible neuron assignments to the workers would be $P^{N}$ .

Now, assume that exactly $n_{k}$ neurons have to be assigned to the $k$ -th worker, where $\sum_{k}n_{k}=N$ . Clearly, there are

{N\choose{n_{1},n_{2},\ldots,n_{P}}}

possible neuron assignment to the workers. To have a relatively balanced neuron assignment (i.e., no worker or a small subset of workers has to process almost all signals), we assume that $n_{k}=c_{k}\,N$ , where $c_{k}=\Theta(1/P)$ , i.e., there exists $\alpha,\beta>0$ such that $\alpha N/P\leq n_{k}\leq\beta N/P$ . Using Stirling’s approximation for factorial, $n_{k}!\sim\sqrt{2\pi n_{k}}\,(\frac{n_{k}}{e})^{n_{k}}$ , and noting that $n_{k}=N\Theta(\frac{1}{P})$ , $\sum_{k}n_{k}=N$ , we have

	$\displaystyle{N\choose{n_{1},n_{2},\ldots,n_{P}}}$	$\displaystyle\sim\frac{\sqrt{2\pi N}\,(\frac{N}{e})^{N}}{\prod_{k=1}^{P}\sqrt{2\pi n_{k}}\,(\frac{n_{k}}{e})^{n_{k}}}$
		$\displaystyle=\frac{\sqrt{2\pi N}\,(\frac{N}{e})^{N}}{\prod_{k=1}^{P}\sqrt{2\pi N\Theta(\frac{1}{P})}\,\big{(}\frac{N\Theta(\frac{1}{P})}{e}\big{)}^{n_{k}}}$
		$\displaystyle=\frac{1}{(2\pi N)^{\frac{P}{2}-1}}\,\frac{1}{\Theta(\frac{1}{P^{N+0.5}})}$
		$\displaystyle=\Theta\big{(}P^{N+0.5}\,N^{1-\frac{P}{2}}\big{)}.$

Therefore, the direct approach to find good neuron assignment for parallel distributed inference requires evaluation of $\mathcal{O}(P^{N})$ different assignments, which for large number of neurons or number of workers becomes prohibitive.

Appendix B Application of RePurpose in Deep Neural Networks

Recall that at the core of the RePurpose algorithm is solving the optimization problem and finding the cost of assigning neuron $i$ to worker $j$ , given by

C_{ji}=\min_{\widehat{\bm{w}}_{i}}\|\bm{w}_{i}-\widehat{\bm{w}}_{i}\|_{2}^{2}+\eta_{1}\|\widehat{\bm{w}}_{i}\|_{0}+\eta_{2}\|\widehat{\bm{w}}_{i,\backslash j}\|_{0},

(7)

where $\eta_{1}$ and $\eta_{2}$ control the trade-off between the error, sparsity, and cross-communication.

The basic RePurpose function and its application to a deep neural network with weights and biases $\{\bm{W}^{(l)},\bm{b}^{(l)}\}$ are summarized in Algorithms 1 and 2, respectively. In Alg. 2, $n_{k}^{(l)}$ is the number of neurons in layer $l$ being assigned to worker $k$ , and $\mathcal{H}_{\bm{E}}(\cdot)$ is the (modified) element-wise hard-thresholding operator, defined as

\bm{Y}=\mathcal{H}_{\bm{E}}(\bm{X}):\quad Y_{ij}=\left\{\begin{array}[]{ll}0&\textup{if }|X_{ij}|^{2}\leq E_{ij}\\ X_{ij}&\textup{o.w.}\end{array}\right.

(8)

Input: $\{\bm{W}^{(l)}\}_{{}_{l}}$ , $\{\bm{b}^{(l)}\}_{{}_{l}}$ , $\{n_{k}^{(l)}\}_{{}_{k,l}}$ , $\eta_{1}$ , $\eta_{2}$ Output: $\{\bm{\Pi}^{(l)}\}_{l}$ , $\{\widehat{\bm{W}}^{(l)}\}_{l}$ , $\{\widehat{\bm{b}}^{(l)}\}_{l}$ $\bm{E}=\eta_{1}+\eta_{2}\bm{M}$ $\bm{\Pi}^{(0)}\leftarrow\bm{I}$ for layers $l=1,\ldots,L$ do $\bm{T}\leftarrow\bm{\Pi}^{(l-1)}\bm{W}^{(l)}$ $\bm{\Pi}^{(l)}\leftarrow\textsc{RePurpose}(\bm{T},\{n_{k}^{(l)}\}_{{}_{k}},\eta_{1},\eta_{2})$ $\widehat{\bm{W}}^{(l)}\leftarrow\mathcal{H}_{\bm{E}}\big{(}\bm{T}(\bm{\Pi}^{(l)})^{\mathsf{T}}\big{)}$ $\widehat{\bm{b}}^{(l)}\leftarrow\bm{\Pi}^{(l)}\bm{b}^{(l)}$ Algorithm 2 Applying RePurpose to Deep Neural Networks

Recall that when applying RePurpose to layers of a neural network, permuting neurons of layer $l$ with matrix $\bm{\Pi}$ changes the signal of that layer by $\widehat{\bm{y}}^{(l)}=\bm{\Pi y}^{(l)}$ and affects the weight matrix of that layer by $\bm{W}^{(l)}\bm{\Pi}^{\mathsf{T}}$ . As a result, $\widehat{\bm{x}}^{(l+1)}=\bm{\Pi x}^{(l+1)}$ and to have the same signal at the next layer, $\bm{y}^{(l+1)}$ , the weight matrix of layer $l+1$ should be modified as $\bm{\Pi W}^{(l+1)}$ . Line 2 of Alg. 2 accounts for these adjustments.

Appendix C Performance Guarantee of RePurpose

\bm{y}^{(l)}=(\bm{W}^{(l)})^{\mathsf{T}}\bm{x}^{(l)}+\bm{b}^{(l)},\quad\quad\bm{x}^{(l+1)}=\sigma(\bm{y}^{(l)}),

(9)

where $\sigma(\cdot)$ is the activation function.

To analyze the performance of the modified neural network, assume that the original neural network has the following properties:

A1.

The activation functions are $1$ -Lipschitz, i.e., for all $u,v$ , $|\sigma(u)-\sigma(v)|\leq|u-v|$ .
A2.

The Frobenius norms of the weights of the neural network are bounded, i.e., for some constant $\tau>0$ , $\|\bm{W}^{(l)}\|_{F}\leq\tau$ , for all layers $l=1,\ldots,L$ .
A3.

The signals in the neural networks are bounded, i.e., there exists a constant $B>0$ such that for input signal $\bm{x}^{(1)}=\bm{x}_{in}$ , and forward signals $\{\bm{x}^{(l)}\}_{l=2}^{L}$ (outputs of the hidden layers), $\|\bm{x}^{(l)}\|_{2}\leq B$ for $l=1,\ldots,L$ .

Moreover, suppose that the parameters $\eta_{1}$ and $\eta_{2}$ at each call of the RePurpose are adjusted such that the solution of Lagrangian formulation (6), given by RePurpose, is also the solution of the following constrained optimization problem

\min_{\widehat{\bm{W}},\bm{\Pi}}\|\bm{M}\odot\widehat{\bm{W}}\|_{0}\quad\quad\operatorname*{s.t.}~{}\|\widehat{\bm{W}}-\bm{T\Pi}^{\mathsf{T}}\|_{F}^{2}\leq\epsilon.

(10)

Hence, by Alg. 2 and the cascade application of RePurpose, the modified weight matrix of the $l$ -th layer of neural network satisfies $\|\widehat{\bm{W}}^{(l)}-\bm{\Pi}^{(l-1)}\bm{W}^{(l)}(\bm{\Pi}^{(l)})^{\mathsf{T}}\|_{F}^{2}\leq\epsilon$ . For the simplicity in notations, let $\varepsilon=\sqrt{\epsilon}$ .

Theorem 3.

For an input data $\bm{x}$ , let $\bm{y}$ and $\widehat{\bm{y}}$ be the outputs of the original and RePurposed neural network, respectively. If $\bm{\Pi}$ is the permutation of the final output neurons in the RePurposed neural network, then under assumptions A1-3,

\|\widehat{\bm{y}}-\bm{\Pi y}\|_{2}\leq\varepsilon\frac{(\tau+\varepsilon)^{L}-1}{\tau+\varepsilon-1}B.

(11)

Especially, if the parameters of the neural network are normalized such that $\|\bm{W}^{(l)}\|_{F}=1$ , then $\|\widehat{\bm{y}}-\bm{\Pi y}\|_{2}\leq\big{(}(1+\varepsilon)^{L}-1\big{)}B.$

Proof.

Let $\bm{x}^{(l)}$ and $\widehat{\bm{x}}^{(l)}$ be the signals in the original and modified neural network, corresponding to the input $\bm{x}$ . Note that $\bm{\Pi}^{(0)}=\bm{I}$ and the input to both networks are the same, $\bm{x}^{(1)}=\widehat{\bm{x}}^{(1)}=\bm{x}$ . Let $\bm{\Pi}^{(l)}$ and $\{\widehat{\bm{W}}^{(l)},\widehat{\bm{b}}^{(l)}\}$ be the permutation matrix and parameters of the modified neural network, found via (3). Therefore, using $\bm{x}^{(l+1)}=\sigma((\bm{W}^{(l)})^{\mathsf{T}}\bm{x}^{(l)}+\bm{b}^{(l)})$ , for any arbitrary layer $l$ ,

		$\displaystyle\\|\bm{\Pi}^{(l)}\bm{x}^{(l+1)}-\widehat{\bm{x}}^{(l+1)}\\|_{2}$
	$\displaystyle=~{}$	$\displaystyle\\|\bm{\Pi}^{(l)}\sigma((\bm{W}^{(l)})^{\mathsf{T}}\bm{x}^{(l)}+\bm{b}^{(l)})-\sigma((\widehat{\bm{W}}^{(l)})^{\mathsf{T}}\widehat{\bm{x}}^{(l)}+\widehat{\bm{b}}^{(l)})\\|_{2}$
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}~{}$	$\displaystyle\\|\sigma(\bm{\Pi}^{(l)}(\bm{W}^{(l)})^{\mathsf{T}}\bm{x}^{(l)}+\bm{\Pi}^{(l)}\bm{b}^{(l)})-\sigma((\widehat{\bm{W}}^{(l)})^{\mathsf{T}}\widehat{\bm{x}}^{(l)}+\widehat{\bm{b}}^{(l)})\\|_{2}$
	$\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}~{}$	$\displaystyle\\|(\bm{\Pi}^{(l)}(\bm{W}^{(l)})^{\mathsf{T}}\bm{x}^{(l)}+\bm{\Pi}^{(l)}\bm{b}^{(l)})-((\widehat{\bm{W}}^{(l)})^{\mathsf{T}}\widehat{\bm{x}}^{(l)}+\widehat{\bm{b}}^{(l)})\\|_{2}$
	$\displaystyle\stackrel{{\scriptstyle(c)}}{{=}}~{}$	$\displaystyle\\|\bm{\Pi}^{(l)}(\bm{W}^{(l)})^{\mathsf{T}}\bm{x}^{(l)}-(\widehat{\bm{W}}^{(l)})^{\mathsf{T}}\widehat{\bm{x}}^{(l)}\\|_{2}$
	$\displaystyle=~{}$	$\displaystyle\\|\big{(}\bm{\Pi}^{(l-1)}\bm{W}^{(l)}(\bm{\Pi}^{(l)})^{\mathsf{T}}-\widehat{\bm{W}}^{(l)}\big{)}^{\mathsf{T}}\widehat{\bm{x}}^{(l)}+\big{(}\bm{W}^{(l)}(\bm{\Pi}^{(l)})^{\mathsf{T}}\big{)}^{\mathsf{T}}\big{(}(\bm{\Pi}^{(l-1)})^{\mathsf{T}}\widehat{\bm{x}}^{(l)}-\bm{x}^{(l)}\big{)}\\|_{2}$
	$\displaystyle\leq~{}$	$\displaystyle\\|\big{(}\bm{\Pi}^{(l-1)}\bm{W}^{(l)}(\bm{\Pi}^{(l)})^{\mathsf{T}}-\widehat{\bm{W}}^{(l)}\big{)}^{\mathsf{T}}\widehat{\bm{x}}^{(l)}\\|_{2}+\\|\big{(}\bm{W}^{(l)}(\bm{\Pi}^{(l)})^{\mathsf{T}}\big{)}^{\mathsf{T}}\big{(}(\bm{\Pi}^{(l-1)})^{\mathsf{T}}\widehat{\bm{x}}^{(l)}-\bm{x}^{(l)}\big{)}\\|_{2}$
	$\displaystyle\stackrel{{\scriptstyle(d)}}{{\leq}}~{}$	$\displaystyle\\|\bm{\Pi}^{(l-1)}\bm{W}^{(l)}(\bm{\Pi}^{(l)})^{\mathsf{T}}-\widehat{\bm{W}}^{(l)}\\|_{F}\,\\|\widehat{\bm{x}}^{(l)}\\|_{2}+\\|\bm{W}^{(l)}(\bm{\Pi}^{(l)})^{\mathsf{T}}\\|_{F}\,\\|(\bm{\Pi}^{(l-1)})^{\mathsf{T}}\widehat{\bm{x}}^{(l)}-\bm{x}^{(l)}\\|_{2}$
	$\displaystyle=~{}$	$\displaystyle\\|\widehat{\bm{W}}^{(l)}-\bm{\Pi}^{(l-1)}\bm{W}^{(l)}(\bm{\Pi}^{(l)})^{\mathsf{T}}\\|_{F}\,\\|\widehat{\bm{x}}^{(l)}\\|_{2}+\\|\bm{W}^{(l)}\\|_{F}\,\\|\widehat{\bm{x}}^{(l)}-\bm{\Pi}^{(l-1)}\bm{x}^{(l)}\\|_{2}$
	$\displaystyle\stackrel{{\scriptstyle(e)}}{{\leq}}~{}$	$\displaystyle\varepsilon\\|\widehat{\bm{x}}^{(l)}\\|_{2}+\tau\\|\bm{\Pi}^{(l-1)}\bm{x}^{(l)}-\widehat{\bm{x}}^{(l)}\\|_{2}$
	$\displaystyle\leq~{}$	$\displaystyle\varepsilon\big{(}\\|\widehat{\bm{x}}^{(l)}-\bm{\Pi}^{(l-1)}\bm{x}^{(l)}\\|_{2}+\\|\bm{\Pi}^{(l-1)}\bm{x}^{(l)}\\|_{2}\big{)}+\tau\\|\bm{\Pi}^{(l-1)}\bm{x}^{(l)}-\widehat{\bm{x}}^{(l)}\\|_{2}$
	$\displaystyle=~{}$	$\displaystyle(\tau+\varepsilon)\\|\bm{\Pi}^{(l-1)}\bm{x}^{(l)}-\widehat{\bm{x}}^{(l)}\\|_{2}+\varepsilon\\|\bm{x}^{(l)}\\|_{2}$
	$\displaystyle\leq~{}$	$\displaystyle(\tau+\varepsilon)\\|\bm{\Pi}^{(l-1)}\bm{x}^{(l)}-\widehat{\bm{x}}^{(l)}\\|_{2}+\varepsilon B$

where $(a)$ is because $\bm{\Pi}\sigma(\bm{z})=\sigma(\bm{\Pi z})$ for arbitrary permutation $\bm{\Pi}$ and vector $\bm{z}$ , $(b)$ is because $\sigma(\cdot)$ is $1$ -Lipschitz, $(c)$ is due to the fact that $\widehat{\bm{b}}^{(l)}=\bm{\Pi}^{(l)}\bm{b}^{(l)}$ , $(d)$ is from $\|\bm{Az}\|_{2}\leq\|\bm{A}\|_{2}\|\bm{z}\|_{2}\leq\|\bm{A}\|_{F}\|\bm{z}\|_{2}$ for arbitrary $\bm{A}$ and $\bm{z}$ , and $(e)$ is by assumption A2 and (3). Therefore,

\|\bm{\Pi}^{(l)}\bm{x}^{(l+1)}-\widehat{\bm{x}}^{(l+1)}\|_{2}\leq(\tau+\epsilon)\|\bm{\Pi}^{(l-1)}\bm{x}^{(l)}-\widehat{\bm{x}}^{(l)}\|_{2}+\varepsilon B.

(12)

Since $\bm{x}=\bm{\Pi}^{(0)}\bm{x}^{(1)}=\widehat{\bm{x}}^{(1)}$ , (12) implies that

\|\bm{\Pi}^{(l)}\bm{x}^{(l+1)}-\widehat{\bm{x}}^{(l+1)}\|_{2}\leq\big{(}\sum_{k=1}^{l}(\tau+\epsilon)^{l-k}\big{)}\varepsilon B=\frac{(\tau+\varepsilon)^{l}-1}{\tau+\varepsilon-1}\varepsilon B.

(13)

Specifically, for the output signals, $\bm{y}=\bm{x}^{L+1}$ and $\widehat{\bm{y}}=\widehat{\bm{x}}^{(L+1)}$ , it implies that

\|\widehat{\bm{y}}-\bm{\Pi y}\|_{2}\leq\varepsilon\frac{(\tau+\varepsilon)^{L}-1}{\tau+\varepsilon-1}B.

∎

Therefore, if the hyperparameters of RePurpose are chosen carefully, we can ensure that the output of the modified neural network is close to the original model (after accounting for the possible rearrangement of the neurons of the output layer).

Appendix D Proofs of the Main Results

D.1 Proof of Lemma 1

Lemma.

The solution of

\min_{\bm{x}}\|\bm{y}-\bm{x}\|_{2}^{2}+\eta_{1}\|\bm{x}\|_{0}+\eta_{2}\|\bm{x}_{\backslash j}\|_{0},

(14)

is given by element-wise hard-thresholding $\bm{y}$ , i.e.,

x_{n}=\left\{\begin{array}[]{ll}0&\textup{if }|y_{n}|\leq\sqrt{\eta}\\ y_{n}&\textup{o.w.}\end{array}\right.

(15)

where $\eta=\eta_{1}$ or $\eta_{1}+\eta_{2}$ , depending on whether neuron $n$ is in $\bm{y}_{\backslash j}$ or not.

Proof.

Let $\Omega$ be the indexes in the $j$ -th block. Therefore, $\bm{x}_{\backslash j}$ consists of elements of $\bm{x}$ that are not in the set $\Omega$ , and

	$\displaystyle\\|\bm{y}-\bm{x}\\|_{2}^{2}+\eta_{1}\\|\bm{x}\\|_{0}+\eta_{2}\\|\bm{x}_{\backslash j}\\|_{0}$	$\displaystyle=\sum_{n\in\Omega}(y_{n}-x_{n})^{2}+\eta_{1}\mathds{I}\left(x_{n}\neq 0\right)$
		$\displaystyle+\sum_{n\notin\Omega}(y_{n}-x_{n})^{2}+(\eta_{1}+\eta_{2})\mathds{I}\left(x_{n}\neq 0\right),$

where $\mathds{I}\left(z\right)=1$ if $z$ is true and is $0$ otherwise. Therefore, the minimization in (14) can be cast as separate minimizations over scalars $x_{n}$ . For example, if $n\in\Omega$ , there are two possibilities for $x_{n}$ ,

\left\{\begin{array}[]{ll}x_{n}=0&\Rightarrow cost=y_{n}^{2}\\ x_{n}\neq 0&\Rightarrow cost=\min_{x_{n}\neq 0}(y_{n}-x_{n})^{2}+\eta_{1}=\eta_{1}\end{array}\right.

Hence, the solution would be

n\in\Omega:\quad x_{n}^{*}=\left\{\begin{array}[]{ll}0&\textup{if }|y_{n}|\leq\sqrt{\eta_{1}}\\ y_{n}&\textup{o.w.}\end{array}\right.

Similarly,

n\notin\Omega:\quad x_{n}^{*}=\left\{\begin{array}[]{ll}0&\textup{if }|y_{n}|\leq\sqrt{\eta_{1}+\eta_{2}}\\ y_{n}&\textup{o.w.}\end{array}\right.

∎

D.2 Proof of Theorem 2

Theorem.

Algorithm 1 finds the optimum solution of

\min_{\widehat{\bm{W}},\bm{\Pi}}\|\widehat{\bm{W}}-\bm{W\Pi}^{\mathsf{T}}\|_{F}^{2}+\eta_{1}\|\widehat{\bm{W}}\|_{0}+\eta_{2}\|\bm{M}\odot\widehat{\bm{W}}\|_{0},

(16)

with time complexity $\mathcal{O}(N^{3})$ , where $N$ is the number of layer’s neurons (number of columns of $\bm{W}$ ).

Proof.

First, we note that for any permutation matrix $\bm{\Pi}$ , $\|\widehat{\bm{W}}-\bm{W\Pi}^{\mathsf{T}}\|_{F}^{2}=\|\widehat{\bm{W}}\bm{\Pi}-\bm{W}\|_{F}^{2}$ , $\|\widehat{\bm{W}}\|_{0}=\|\widehat{\bm{W}}\bm{\Pi}\|_{0}$ , and $\|\bm{M}\odot\widehat{\bm{W}}\|_{0}=\|(\bm{M\Pi})\odot(\widehat{\bm{W}}\bm{\Pi})\|_{0}$ . Therefore, by defining $\bm{X}=\widehat{\bm{W}}\bm{\Pi}$ , the optimization (6) can be rewritten as

		$\displaystyle\min_{\bm{\Pi}}\min_{\bm{X}}\\|\bm{X}-\bm{W}\\|_{F}^{2}+\eta_{1}\\|\bm{X}\\|_{0}+\eta_{2}\\|(\bm{M\Pi})\odot\bm{X}\\|_{0}$
	$\displaystyle=$	$\displaystyle\min_{\bm{\Pi}}\min_{\bm{X}}\sum_{{i,k:\,\Pi_{k,i}=1}}\\|\bm{x}_{i}-\bm{w}_{i}\\|_{2}^{2}+\eta_{1}\\|\bm{x}_{i}\\|_{0}+\eta_{2}\\|\bm{m}_{k}\odot\bm{x}_{i}\\|_{0},$
	$\displaystyle=$	$\displaystyle\min_{\bm{\Pi}}\sum_{{i,k:\,\Pi_{k,i}=1}}\min_{\bm{x}_{i}}\\|\bm{x}_{i}-\bm{w}_{i}\\|_{2}^{2}+\eta_{1}\\|\bm{x}_{i}\\|_{0}+\eta_{2}\\|\bm{m}_{k}\odot\bm{x}_{i}\\|_{0}.$

On the other hand, recall that $\bm{M}=1-\operatorname{diag}\left(\bm{1}_{\iota_{{}_{1}}\times n_{{}_{1}}},\ldots,\bm{1}_{\iota_{{}_{P}}\times n_{{}_{P}}}\right)$ , and hence if $\bm{m}_{k}$ is from the $j$ -th sub-block, i.e., it corresponds to the $j$ -th worker, the inner minimization would be

C_{ji}=\min_{\bm{x}_{i}}\|\bm{x}_{i}-\bm{w}_{i}\|_{2}^{2}+\eta_{1}\|\bm{x}_{i}\|_{0}+\eta_{2}\|\bm{x}_{i,\backslash j}\|_{0}.

(17)

By repeating the $k$ -th row of matrix $\bm{C}$ whose elements are defined as (17) to construct the new $N\times N$ matrix $\mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.59721pt}{\hskip 1.29167pt$\widetildesym$}}}}{\bm{C}}}{\accentset{\textstyle\text{\smash{\raisebox{-5.59721pt}{\hskip 1.29167pt$\widetildesym$}}}}{\bm{C}}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.91806pt}{\hskip 0.90417pt$\widetildesym$}}}}{\bm{C}}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.7986pt}{\hskip 0.64583pt$\widetildesym$}}}}{\bm{C}}}$ , we will have $C_{ji}=\tilde{C}_{ki}$ . Therefore,

\displaystyle\min_{\widehat{\bm{W}},\bm{\Pi}}\|\widehat{\bm{W}}-\bm{W\Pi}^{\mathsf{T}}\|_{F}^{2}+\eta_{1}\|\widehat{\bm{W}}\|_{0}+\eta_{2}\|\bm{M}\odot\widehat{\bm{W}}\|_{0}=\min_{\bm{\Pi}}\sum_{{(i,k):\,\Pi_{k,i}=1}}\tilde{C}_{ki}.

As a result, selecting the best neuron assignment boils down to choosing $N$ elements from $\mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.59721pt}{\hskip 1.29167pt$\widetildesym$}}}}{\bm{C}}}{\accentset{\textstyle\text{\smash{\raisebox{-5.59721pt}{\hskip 1.29167pt$\widetildesym$}}}}{\bm{C}}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.91806pt}{\hskip 0.90417pt$\widetildesym$}}}}{\bm{C}}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.7986pt}{\hskip 0.64583pt$\widetildesym$}}}}{\bm{C}}}$ such that from each row or column only one element is selected and the sum of the selected values is minimum. This problem can be solved efficiently in polynomial time using the Hungarian algorithm. Tomizawa [1971], Jonker and Volgenant [1987] solve the assignment algorithm with $\mathcal{O}(N^{3})$ time complexity. Since the complexity of creating $\mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.59721pt}{\hskip 1.29167pt$\widetildesym$}}}}{\bm{C}}}{\accentset{\textstyle\text{\smash{\raisebox{-5.59721pt}{\hskip 1.29167pt$\widetildesym$}}}}{\bm{C}}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.91806pt}{\hskip 0.90417pt$\widetildesym$}}}}{\bm{C}}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.7986pt}{\hskip 0.64583pt$\widetildesym$}}}}{\bm{C}}}$ is at most $\mathcal{O}(N^{2})$ , the total complexity of Algorithm 1 would be $\mathcal{O}(N^{3})$ . ∎

Appendix E Reduction in Computational Complexity

One major benefit of applying RePurpose, as demonstrated in simulations, is the reduction in the computational complexity. For the sake of simplicity, assume that there are $P=2$ workers. Recall that the computations at worker $1$ is given as $\bm{y}_{1}=\bm{W}_{11}^{\mathsf{T}}\bm{x}_{1}+\bm{b}_{1}+\bm{W}_{12}^{\mathsf{T}}\bm{x}_{2}$ . By the application of RePurpose to the weight matrix $\bm{W}$ , the off-diagonal blocks, $\bm{W}_{12}$ and $\bm{W}_{21}$ , become sparse. Let $\Omega$ be the indexes of the columns of $\bm{W}_{12}$ which are non-zero, and define $\mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.59721pt}{\hskip 1.29167pt$\widetildesym$}}}}{\bm{W}}}{\accentset{\textstyle\text{\smash{\raisebox{-5.59721pt}{\hskip 1.29167pt$\widetildesym$}}}}{\bm{W}}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.91806pt}{\hskip 0.90417pt$\widetildesym$}}}}{\bm{W}}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.7986pt}{\hskip 0.64583pt$\widetildesym$}}}}{\bm{W}}}_{12}$ to be the restriction of $\bm{W}_{12}$ to those non-zero columns. Similarly, define $\mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.59721pt}{\hskip 1.29167pt$\widetildesym$}}}}{\bm{x}}}{\accentset{\textstyle\text{\smash{\raisebox{-5.59721pt}{\hskip 1.29167pt$\widetildesym$}}}}{\bm{x}}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.91806pt}{\hskip 0.90417pt$\widetildesym$}}}}{\bm{x}}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.7986pt}{\hskip 0.64583pt$\widetildesym$}}}}{\bm{x}}}_{2}$ to be the restriction of $\bm{x}_{2}$ to the indexes given by $\Omega$ . Therefore, $\bm{y}_{1}$ can be more efficiently calculated as $\bm{y}_{1}=\bm{W}_{11}^{\mathsf{T}}\bm{x}_{1}+\bm{b}_{1}+\mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.59721pt}{\hskip 1.29167pt$\widetildesym$}}}}{\bm{W}}}{\accentset{\textstyle\text{\smash{\raisebox{-5.59721pt}{\hskip 1.29167pt$\widetildesym$}}}}{\bm{W}}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.91806pt}{\hskip 0.90417pt$\widetildesym$}}}}{\bm{W}}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.7986pt}{\hskip 0.64583pt$\widetildesym$}}}}{\bm{W}}}_{12}^{\mathsf{T}}\mathchoice{\accentset{\displaystyle\text{\smash{\raisebox{-5.59721pt}{\hskip 1.29167pt$\widetildesym$}}}}{\bm{x}}}{\accentset{\textstyle\text{\smash{\raisebox{-5.59721pt}{\hskip 1.29167pt$\widetildesym$}}}}{\bm{x}}}{\accentset{\scriptstyle\text{\smash{\raisebox{-3.91806pt}{\hskip 0.90417pt$\widetildesym$}}}}{\bm{x}}}{\accentset{\scriptscriptstyle\text{\smash{\raisebox{-2.7986pt}{\hskip 0.64583pt$\widetildesym$}}}}{\bm{x}}}_{2}$ . If $\bm{W}_{12}$ is an $m\times n$ matrix, the computational complexity and the communication requirement of the cross-term $\bm{W}_{12}^{\mathsf{T}}\bm{x}_{2}$ in the original calculation would be $\mathcal{O}(mn)$ and $\mathcal{O}(m)$ , respectively. RePurpose reduces these complexities to $\mathcal{O}(|\Omega|n)$ and $\mathcal{O}(|\Omega|)$ . As shown in simulations, the set $\Omega$ can be extremely small, making the computational complexity of the cross-term negligible. For example, in applying the proposed technique to an $N\times N$ matrix to distributed its computations over $2$ workers, if the number of cross dependencies are reduced by a factor of $10$ , then the computational complexity of matrix multiplication would be reduced to $0.275\,N^{2}$ per worker, almost $1.8$ times reduction from $N^{2}/2$ in naive parallel implementation.

Appendix F Extension of RePurpose to Convolutional Layers

Consider a convolutional layer whose input consists of $c_{in}$ channels of $d$ -dimensional tensors and its output has $c_{out}$ channels. Let $h(z_{0},\ldots,z_{d-1},c_{in},c_{out})$ be the kernel. For the sake of simplicity in notations, we ignore strides and dilation in convolution operator. Hence, the output would be

O(x_{0},\ldots,x_{d-1},k)=\sum_{l=1}^{c_{in}}\sum_{z_{0},\ldots,z_{d-1}}h(z_{0},\ldots,z_{d-1},l,k)\,I(x_{0}+z_{0},\ldots,x_{d-1}+z_{d-1},l),

where $I(\cdot)$ is the input $d$ -dimensional tensor with $c_{in}$ channels and $O(\cdot)$ is the output tensor.

Note that due to the nature of the convolution operator, it is not possible to rearrange the neurons within each channel (e.g., changing locations of pixels in images). However, we propose to change the order of the channels. Note that the convolution can be rewritten as

O_{k}(x_{0},\ldots,x_{d-1})=\sum_{l=1}^{c_{in}}h_{l,k}*I_{l}\,(x_{0},\ldots,x_{d-1}),

where $h_{l,k}(\cdots)=h(\cdots,l,k)$ is the kernel connecting input channel $l$ to output channel $k$ , $I_{l}(\cdot)$ is the $l$ -th channel of the input tensor, and $O_{k}(\cdot)$ is the $k$ -th output channel. Now, similar to (4), we can define the cost of assigning channel $i$ to the $j$ -th worker as follows:

C_{ji}=\min_{\{\widehat{\bm{h}}_{l,i}\}}\sum_{l=1}^{c_{in}}\|\bm{h}_{l,i}-\widehat{\bm{h}}_{l,i}\|_{F}^{2}+\eta_{1}\sum_{l=1}^{c_{in}}\mathds{I}\left(\widehat{\bm{h}}_{l,i}\neq\bm{0}\right)+\eta_{2}\sum_{l:l\notin\mathcal{C}_{in}(j)}\mathds{I}\left(\widehat{\bm{h}}_{l,i}\neq\bm{0}\right),

(18)

where $\mathcal{C}_{in}(j)$ is the set of input channels located at the $j$ -th worker, and $\mathds{I}\left(z\right)=1$ if $z$ is true, and is $0$ , otherwise. Note that for the convolutional layers, we treat the individual filters as a whole, and the entire channel filter may be set to zero, not the individual coefficients. The solution of (18) is given by hard-thresholding,

\widehat{\bm{h}}_{l,i}=\left\{\begin{array}[]{ll}\bm{0}&\|\bm{h}_{l,i}\|_{F}^{2}\leq\eta\\ \bm{h}_{l,i}&\textup{o.w.}\end{array}\right.

(19)

where $\eta=\eta_{1}$ if $l\in\mathcal{C}_{in}(j)$ and $\eta=\eta_{1}+\eta_{2}$ , otherwise.

With the new assignment cost, RePurpose for convolutional layers is simply given as in Alg. 1.

		$\displaystyle\\|\bm{\Pi}^{(l)}\bm{x}^{(l+1)}-\widehat{\bm{x}}^{(l+1)}\\|_{2}$
	$\displaystyle=~{}$	$\displaystyle\\|\bm{\Pi}^{(l)}\sigma((\bm{W}^{(l)})^{\mathsf{T}}\bm{x}^{(l)}+\bm{b}^{(l)})-\sigma((\widehat{\bm{W}}^{(l)})^{\mathsf{T}}\widehat{\bm{x}}^{(l)}+\widehat{\bm{b}}^{(l)})\\|_{2}$
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}~{}$	$\displaystyle\\|\sigma(\bm{\Pi}^{(l)}(\bm{W}^{(l)})^{\mathsf{T}}\bm{x}^{(l)}+\bm{\Pi}^{(l)}\bm{b}^{(l)})-\sigma((\widehat{\bm{W}}^{(l)})^{\mathsf{T}}\widehat{\bm{x}}^{(l)}+\widehat{\bm{b}}^{(l)})\\|_{2}$
	$\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}~{}$	$\displaystyle\\|(\bm{\Pi}^{(l)}(\bm{W}^{(l)})^{\mathsf{T}}\bm{x}^{(l)}+\bm{\Pi}^{(l)}\bm{b}^{(l)})-((\widehat{\bm{W}}^{(l)})^{\mathsf{T}}\widehat{\bm{x}}^{(l)}+\widehat{\bm{b}}^{(l)})\\|_{2}$
	$\displaystyle\stackrel{{\scriptstyle(c)}}{{=}}~{}$	$\displaystyle\\|\bm{\Pi}^{(l)}(\bm{W}^{(l)})^{\mathsf{T}}\bm{x}^{(l)}-(\widehat{\bm{W}}^{(l)})^{\mathsf{T}}\widehat{\bm{x}}^{(l)}\\|_{2}$
	$\displaystyle=~{}$	$\displaystyle\\|\big{(}\bm{\Pi}^{(l-1)}\bm{W}^{(l)}(\bm{\Pi}^{(l)})^{\mathsf{T}}-\widehat{\bm{W}}^{(l)}\big{)}^{\mathsf{T}}\widehat{\bm{x}}^{(l)}+\big{(}\bm{W}^{(l)}(\bm{\Pi}^{(l)})^{\mathsf{T}}\big{)}^{\mathsf{T}}\big{(}(\bm{\Pi}^{(l-1)})^{\mathsf{T}}\widehat{\bm{x}}^{(l)}-\bm{x}^{(l)}\big{)}\\|_{2}$
	$\displaystyle\leq~{}$	$\displaystyle\\|\big{(}\bm{\Pi}^{(l-1)}\bm{W}^{(l)}(\bm{\Pi}^{(l)})^{\mathsf{T}}-\widehat{\bm{W}}^{(l)}\big{)}^{\mathsf{T}}\widehat{\bm{x}}^{(l)}\\|_{2}+\\|\big{(}\bm{W}^{(l)}(\bm{\Pi}^{(l)})^{\mathsf{T}}\big{)}^{\mathsf{T}}\big{(}(\bm{\Pi}^{(l-1)})^{\mathsf{T}}\widehat{\bm{x}}^{(l)}-\bm{x}^{(l)}\big{)}\\|_{2}$
	$\displaystyle\stackrel{{\scriptstyle(d)}}{{\leq}}~{}$	$\displaystyle\\|\bm{\Pi}^{(l-1)}\bm{W}^{(l)}(\bm{\Pi}^{(l)})^{\mathsf{T}}-\widehat{\bm{W}}^{(l)}\\|_{F}\,\\|\widehat{\bm{x}}^{(l)}\\|_{2}+\\|\bm{W}^{(l)}(\bm{\Pi}^{(l)})^{\mathsf{T}}\\|_{F}\,\\|(\bm{\Pi}^{(l-1)})^{\mathsf{T}}\widehat{\bm{x}}^{(l)}-\bm{x}^{(l)}\\|_{2}$
	$\displaystyle=~{}$	$\displaystyle\\|\widehat{\bm{W}}^{(l)}-\bm{\Pi}^{(l-1)}\bm{W}^{(l)}(\bm{\Pi}^{(l)})^{\mathsf{T}}\\|_{F}\,\\|\widehat{\bm{x}}^{(l)}\\|_{2}+\\|\bm{W}^{(l)}\\|_{F}\,\\|\widehat{\bm{x}}^{(l)}-\bm{\Pi}^{(l-1)}\bm{x}^{(l)}\\|_{2}$
	$\displaystyle\stackrel{{\scriptstyle(e)}}{{\leq}}~{}$	$\displaystyle\varepsilon\\|\widehat{\bm{x}}^{(l)}\\|_{2}+\tau\\|\bm{\Pi}^{(l-1)}\bm{x}^{(l)}-\widehat{\bm{x}}^{(l)}\\|_{2}$
	$\displaystyle\leq~{}$	$\displaystyle\varepsilon\big{(}\\|\widehat{\bm{x}}^{(l)}-\bm{\Pi}^{(l-1)}\bm{x}^{(l)}\\|_{2}+\\|\bm{\Pi}^{(l-1)}\bm{x}^{(l)}\\|_{2}\big{)}+\tau\\|\bm{\Pi}^{(l-1)}\bm{x}^{(l)}-\widehat{\bm{x}}^{(l)}\\|_{2}$
	$\displaystyle=~{}$	$\displaystyle(\tau+\varepsilon)\\|\bm{\Pi}^{(l-1)}\bm{x}^{(l)}-\widehat{\bm{x}}^{(l)}\\|_{2}+\varepsilon\\|\bm{x}^{(l)}\\|_{2}$
	$\displaystyle\leq~{}$	$\displaystyle(\tau+\varepsilon)\\|\bm{\Pi}^{(l-1)}\bm{x}^{(l)}-\widehat{\bm{x}}^{(l)}\\|_{2}+\varepsilon B$

		$\displaystyle\min_{\bm{\Pi}}\min_{\bm{X}}\\|\bm{X}-\bm{W}\\|_{F}^{2}+\eta_{1}\\|\bm{X}\\|_{0}+\eta_{2}\\|(\bm{M\Pi})\odot\bm{X}\\|_{0}$
	$\displaystyle=$	$\displaystyle\min_{\bm{\Pi}}\min_{\bm{X}}\sum_{{i,k:\,\Pi_{k,i}=1}}\\|\bm{x}_{i}-\bm{w}_{i}\\|_{2}^{2}+\eta_{1}\\|\bm{x}_{i}\\|_{0}+\eta_{2}\\|\bm{m}_{k}\odot\bm{x}_{i}\\|_{0},$
	$\displaystyle=$	$\displaystyle\min_{\bm{\Pi}}\sum_{{i,k:\,\Pi_{k,i}=1}}\min_{\bm{x}_{i}}\\|\bm{x}_{i}-\bm{w}_{i}\\|_{2}^{2}+\eta_{1}\\|\bm{x}_{i}\\|_{0}+\eta_{2}\\|\bm{m}_{k}\odot\bm{x}_{i}\\|_{0}.$