AGO: Boosting Mobile AI Inference Performance by Removing Constraints on Graph Optimization

Zhiying Xu^‡1, Hongding Peng^‡1, Wei Wang2 1Nanjing University, China {zyxu, mg20330044}@smail.nju.edu.cn 2Nanjing University, China [email protected]

Abstract

Traditional deep learning compilers rely on heuristics for subgraph generation, which impose extra constraints on graph optimization, e.g., each subgraph can only contain at most one complex operator. In this paper, we propose AGO, a framework for graph optimization with arbitrary structures to boost the inference performance of deep models by removing such constraints. To create new optimization opportunities for complicated subgraphs, we propose intensive operator fusion, which can effectively stitch multiple complex operators together for better performance. Further, we design a graph partitioning scheme that allows an arbitrary structure for each subgraph while guaranteeing the acyclic property among all generated subgraphs. Additionally, to enable efficient performance tuning on complicated subgraphs, we devise a novel divide-and-conquer tuning mechanism to orchestrate different system components. Through extensive experiments on various neural networks and mobile devices, we show that our system can improve the inference performance by up to $3.3\times$ when compared with state-of-the-art deep compilers.

^†^†^‡ Equal contributions.

I Introduction

Deep learning has become an essential building block for various mobile applications, such as machine translation and recommendation systems. The user experience of these mobile applications are critically impacted by the efficiency of running deep learning inference tasks on mobile devices. Therefore, code optimization of tensor operators, such as convolution and matrix multiplication, becomes an important research issue for mobile systems. While manual optimization can lead to order-of-magnitude reduction in inference delay [1], it typically incurs tremendous human efforts, as the tuning process highly depends on the specific hardware architecture as well as the neural network structure. To relieve developers from the burden of hand-tuning, researchers design deep learning compilers, such as XLA [2], TVM [3], and Tiramisu [4], to perform automatic code optimization with compilation and auto-tuning techniques. In these compilers, each operator is represented as a node in a computational graph, and the tuning result for an operator is called a schedule.

The system design of deep compilers can be divided into two major layers. On the top layer, a graph frontend partitions the computational graph into multiple subgraphs. In contrast with the huge optimization space of the whole graph, optimizing each subgraph separately is more manageable. To achieve this, existing graph frontends, such as Relay [5] and Apollo [6], group adjacent operators into the same subgraph according to specific heuristics. As a result, a subgraph can contain at most one complex operator (e.g., convolution and matrix multiplication) and other simple operators (e.g., padding, add, and ReLU). The bottom layer is tuner backend, where enormous schedules are explored in the tuning space of each subgraph separately. In general, the best schedule is composed of the optimal parameters for various code optimization techniques, e.g., selected data layouts, tile sizes for loop tiling, and operator fusion schemes. Existing tuners either exploit search-based methods [7, 8, 9] or polyhedral models [10, 6] to achieve such exploration. After exploration, the compiler will generate an optimized tensor program based on the best schedule.

Unfortunately, existing works cannot handle arbitrary graphs efficiently, especially when facing emerging new neural models that tend to use complex structures [11, 12, 13, 14, 15, 16, 17]. First, the graph frontend heavily relies on offline hard-coded heuristics to perform graph partitioning. These heuristics only produce simple subgraph structures while overlooking other optimization opportunities. They also do not have enough scalability to cater to new neural networks. Second, the tuner implicitly limits itself to a small tuning space, because the space is bounded by the simple subgraph generated by the frontend. Such strict constraints on graph optimization seriously compromises the inference performance. For example, existing frontends cannot generate subgraphs with multiple complex operators to enable intensive fusion and joint optimization. By contrast, we observe that removing this constraint can achieve up to $3.3\times$ speedup for end-to-end inference.

This paper proposes AGO, a framework that enables arbitrary structure graph optimization to boost the inference performance of mobile deep learning. In the top graph frontend, AGO exploits a new weighted clustering algorithm to perform graph partitioning, each of the generated subgraphs is free of prior constraints and may contain multiple complex operators. In the bottom backend, AGO designs a more powerful tuner, which can automatically explore schedules for any subgraph. Additionally, different from prior arts, AGO incorporates an extra middle reformer layer to orchestrate the frontend and the backend for efficient subgraph optimization.

We need to address three unique challenges to achieve the arbitrary structure graph optimization.

Challenge 1: How to remove the constraints on subgraph structures while keeping the network acyclic? Allowing arbitrary subgraph structures means that any edge in the original graph can cross a cut in the partition. However, this can lead to cycles among the generated subgraphs. Cyclic dependencies will result in deadlocks when executing these subgraphs at the runtime. To address this issue, we analytically scrutinize the inter-subgraph data dependency given a directed computational graph. We then show that we can safely group operators in the affix set without generating cycles. Based on our analysis, we devise an iterative clustering algorithm achieving the acyclic property in the graph frontend.

Challenge 2: How to efficiently tune arbitrary subgraphs? The primary hurdle to optimize a subgraph with any structure lies in the case that multiple complex operators reside in the same subgraph, which we call complicated subgraphs. Although a complicated subgraph can expand the search space to open more optimization opportunities, directly exploring schedules in such a large space will incur formidable computational costs, hence inefficient tuning. We use two schemes to improve the tuning efficiency. First, when generating subgraphs in the graph frontend, we assign a weight for each operator via systematically modeling the relationship between the subgraph structure and the tuning complexity. Therefore, we can easily avoid unreasonably huge subgraphs by suppressing the weight. Second, during tuning, we propose a divide-and-conquer mechanism in the reformer layer to handle a complicated subgraph. The reformer layer further splits a subgraph into several mini-subgraphs, each of which is small to be tuned efficiently. After several rounds of tuning mini-subgraphs, the reformer layer will join them back as a large subgraph for further optimization.

Challenge 3: How to create new optimization opportunities given a complicated subgraph? One of the most influential tuning techniques to optimize a subgraph is to fuse operators together so that expensive memory accesses can be reduced. However, different from the conventional operator fusion, fusing multiple complex operators in a complicated subgraph can induce redundant computation, which poses an enormous challenge for the bottom tuner. To avoid the dilemma between the redundancy of fusion and the insufficient optimization without fusion, we systematically analyze the inter-operator data dependency in complicated subgraphs. We then discover two categories of subgraph structures that can enable operator fusion while obviating re-computation, which we call intensive fusion. Therefore, we can exploit new optimization techniques when a complicated subgraph falls into one of the two categories. Additionally, when this condition is unmet, our tuner can still benefit from joint optimization for all operators in a complicated subgraph, while the tuning efficiency is already emphasized by the reformer layer.

In summary, we make the following contributions:

$\bullet$ We reveal that the strict constraints imposed on graph optimization is a major obstacle for deep learning compilers to catering to emerging complicated neural architectures.

$\bullet$ We design a weighted clustering algorithm to perform graph partitioning in the frontend, which removes the constraints on subgraph structures while guaranteeing the acyclic property in the resulting partition.

$\bullet$ We devise a divide-and-conquer tuning mechanism to efficiently tune complicated subgraphs, which serves as a middle layer to orchestrate the frontend and the backend.

$\bullet$ We craft a powerful tuner in the backend, which automatically and effectively optimizes complicated subgraphs through the proposed intensive fusion technique.

We integrate AGO into an existing deep compiler and conduct extensive experiments for evaluation. The results show that AGO improves the inference performance by up to $3.3\times$ compared with state-of-the-art hand-tuned libraries and auto-tuning frameworks, e.g., Torch Mobile [18] and Ansor [9].

II System Overview

Refer to caption — Figure 1: An illustrative computational graph.

The input for a deep compiler is a deep learning model, which can be represented as a computational graph where operators and tensors are denoted as nodes and edges respectively, as illustrated in Fig. 1. Complex operators, such as convolution and matrix multiplication operators, are represented as green nodes in Fig. 1, while the orange nodes represent simple operators, e.g., add, ReLU, and normalization. In prior deep compilers [3, 5, 9, 6], the computational graph is first partitioned by heuristics into many small subgraphs. Each subgraph in these frameworks can contain at most one complex operator. Thus, $op_{1}$ and $op_{2}$ must reside in two different subgraphs, although they share the same input tensor and can be stitched together to improve data locality. Operators $op_{3}$ and $op_{4}$ may constitute another subgraph, even if such no-complex subgraph is trivial, hence no room for performance tuning. The other branch in Fig. 1 contains two complex operators, $op_{5}$ and $op_{7}$ , which are forced to be partitioned into two subgraphs although combining all the three operators may benefit from intensive operator fusion. Therefore, existing heuristics generate unbalanced, small, and simple subgraph structures and hinder further optimization opportunities.

We observe that such inefficient partitioning originates from two aspects. First, the heuristics employed in the graph frontend introduce many unnecessary constraints on the subgraph structure. Second, the underlying tuner backend cannot handle complicated subgraph structures due to the over-simplification of the tuning space. Both of them contribute significantly.

To address this issue, AGO exploits a weighted clustering algorithm to perform graph partitioning which allows arbitrary subgraph structures. Moreover, AGO designs a more powerful tuner achieving intensive fusion to handle any complicated subgraphs. For instance, $op_{1}$ , $op_{2}$ , $op_{3}$ , and $op_{4}$ can be grouped together for operator fusion and joint optimization. Also, we can place $op_{5}$ , $op_{6}$ , and $op_{7}$ in the same subgraph in the frontend, and then intensively fuse them in the backend to further boost the performance.

The workflow of AGO is illustrated in Fig. 2.

1.

Given a model file generated by common deep learning frameworks (e.g., TensorFlow [19]), the graph frontend first resolves it into a computational graph $G$ .
2.

The frontend partitions operators in $G$ into $n$ subgraphs, each of which is denoted as $S_{i}$ , where $1\leq i\leq n$ .
3.

In the reformer layer, AGO further splits each $S_{i}$ into $m_{i}$ mini-subgraphs, each of which is denoted as $M_{ij}$ , where $1\leq j\leq m_{i}$ .
4.

We then offload each $M_{ij}$ as a tuning task for the tuner backend.
5.

After preliminary mini-subgraph optimization, the backend provides the tuned schedules as feedbacks to the reformer layer.
6.

Depending on the feedback, the reformer layer joins the selected mini-subgraphs $M_{ij}$ back as a large subgraph.
7.

Each of the merged subgraph $S_{i}$ becomes a new tuning task for the tuner backend.
8.

Finally, after optimizing each subgraph $S_{i}$ , we will generate more efficient codes based on the tuned schedules.

Next, we will elaborate our system in a bottom-up manner.

III Subgraph Optimization in Backend

In deep models, a tensor operator can be implemented as deeply-nested loops in the source code. Thus most optimization techniques can be achieved by loop transformation. For instance, we can split and reorder loops to achieve loop tiling, or merge loops of two operators for operator fusion. In practice, tiling and fusion are nearly the most two influential techniques for performance optimization. Specially, operator fusion works on multiple operators, hence is dependent on subgraph structures. To optimize arbitrary subgraphs, our major solution in the tuner backend is to craft a new operator fusion scheme, named as intensive operator fusion.

In this section, we first introduce how conventional operator fusion works in a mini-subgraph $M_{ij}$ . Then, we will depict the intensive fusion and how to exploit it for a subgraph $S_{i}$ .

III-A Conventional Operator Fusion for Mini-subgraph

Split from a subgraph by the later reformer layer, each mini-subgraph $M_{ij}$ contains at most one complex operator. Suppose $M_{ij}$ contains three operators: 2-d convolution, bias addition, and ReLU, which is a typical workload for recent convolutional neural networks. In the following, we will use this mini-subgraph as an example to illustrate how operator fusion improves its computational efficiency.

⬇

for n in range(N):

for o in range(O):

for h in range(H):

for w in range(W):

Conv[n,c,h,w] = 0.0

for ri, rr, rc in range(I, R, C):

Conv[n,o,h,w] += \ Inp[n,ri,h+rr,w+rc] * Weight[o,ri,rr,rc]

for n, o, h, w in range(N, O, H, W):

Sum[…] = Conv[…] + Bias[o]

for n, o, h, w in range(N, O, H, W):

ReLU[…] = max(Sum[…], 0.0)

Figure 3: Loop nest of a mini-subgraph without fusion.

We present the initial loop nest of $M_{ij}$ in Fig. 3 ¹¹1Direct convolution is preferable than matrix multiplication based implementation [20]. The latter is often used when lacking direct library supports. , where $N,O,H,W$ represent the batch size, the number of output channels, the height, and the width of the output tensor, respectively. $I$ is the number of input channels, and $R,C$ are the height and the width of the convolutional window. Besides, the three reduction loops $I,R,C$ are written in one line for simplicity. In this program, the bias addition is executed after the whole convolution. When the Conv tensor is large, most of its elements will have been spilled out of the cache at bias addition. Subsequently, we must fetch these elements from the main memory into cache again when performing the addition. Such many additional cache misses lead to poor performance, especially on mobile devices with small caches.

⬇

for n, o, h, w in range(N, O, H, W):

Conv[n, o, h, w] = 0.0

for ri, rr, rc in range(I, R, C):

Conv[…] += Inp[…] * Weight[…]

Sum[…] = Conv[…] + Bias[o]

ReLU[…] = max(Sum[…], 0.0)

Figure 4: Conventional fusion within a mini-subgraph.

We can perform operator fusion within $M_{ij}$ to strike this issue, as illustrated in Fig. 4. Once each element of the Conv tensor is calculated, the following addition and ReLU operations will be performed. Thus, data elements are immediately consumed by downstream operators while still in cache, hence improved operational intensity and inter-operator data locality.

III-B Intensive Operator Fusion for Subgraph

Conventional operator fusion only stitches a complex operator with its following simple operators, hence is also named as epilogue fusion. To optimize subgraphs with complicated structures, we propose intensive fusion, which allows fusing multiple complex operators together. Intensive fusion can further improve the inter-complex-operator data locality without the strict constraints on the subgraph structure, i.e., only one complex operator is allowed in prior schemes.

Unfortunately, the profits from intensive fusion are not free. Since a complex operator involves reduction, the fusion-after-tiling optimization path often induces redundant computation.

In this subsection, we first systematically analyze the inter-operator data dependency to characterize redundant computation. Based on the analysis, we will depict how to achieve intensive fusion efficiently by identifying two categories of subgraph structures without redundancy.

III-B1 Why Re-computation Happens

Suppose we have two 2-d convolution operators in a subgraph $S_{i}$ , and now we try to fuse them.

⬇

for n, o2, h, ow in range(N, O2, H2, W2 // 16):

# intra-tile loops, compute a tile of Conv1

for o1, ih1, iw1 in range(O1, 1 + R2 - 1, 16 + C2 - 1):

Conv1[n, o1, h+ih1, ow*16+iw1] = 0.0

for ri, rr1, rc1 in range(I, R1, C1): # reduction

Conv1[n, o1, h+ih1, ow*16+iw1] += …

# intra-tile loops, compute a tile of Conv2

for io2, ih2, iw2 in range(1, 1, 16):

Conv2[…, ow*16+iw2] = 0.0

for ro, rr2, rc2 in range(O1, R2, C2): # reduction

Conv2[…] += \ Conv1[…, ow*16+iw2+rc2] * Weight2[…]

Figure 5: Intensive fusion program of two convolutions.

In Fig. 5, we use $(N,O_{1},H_{1},W_{1})$ and $(N,O_{2},H_{2},W_{2})$ to represent the shapes of the upstream Conv1 tensor and the downstream Conv2 tensor, respectively. In this case, we also have $H_{1}=H_{2}+(R_{2}-1)$ and $W_{1}=W_{2}+(C_{2}-1)$ according to the convolution algorithm. For simplicity, assume the tiling of Conv2 is $1\times 1\times 16$ on $O_{2}\times H_{2}\times W_{2}$ dimensions ( $W_{2}>16$ ). A tile is a fraction of some tensor. Namely, the intra-tile loops {io2, ih2, iw2} only compute a vector of length $16$ of Conv2. According to the convolution algorithm, this vector-like tile requires a $O_{1}\times\left(1+(R_{2}-1)\right)\times\left(16+(C_{2}-1)\right)$ tile of Conv1 for computation, which is provided by the {o1, ih1, iw1} loops. However, the reduction loops of the upstream convolution (i.e., the ri, rr1, rc1 loops) will be executed $N\times O_{2}\times H_{2}\times\dfrac{W_{2}}{16}\times O_{1}\times R_{2}\times(15+C_{2})$ times in total, which is much larger than the non-fusion case $N\times O_{1}\times(H_{2}+R_{2}-1)\times(W_{2}+C_{2}-1)$ .

We here formally depict the redundancy issue. We denote the global iteration space spanned by the loops of the upstream operator and the downstream operator as $GS_{1}$ and $GS_{2}$ . Then the amount of computation can be calculated as $|GS_{1}|$ and $|GS_{2}|$ respectively. After loop tiling, we denote the iteration space spanned by the inner intra-tile loops as $TS_{1}$ and $TS_{2}$ . Then the amount of computation is $\left|\frac{GS_{1}}{TS_{1}}\right|\cdot|TS_{1}|$ and $\left|\frac{GS_{2}}{TS_{2}}\right|\cdot|TS_{2}|$ respectively. Here we use $\frac{(\cdot)}{(\cdot)}$ as the inverse operator of Cartesian product $\times$ . Next, after loop fusion, the intra-tile loops for $TS_{1}$ of the upstream operator will be attached to the loop structure $\frac{GS_{2}}{TS_{2}}$ of the downstream operator. Thus the iteration space size of the upstream operator can be derived as $\left|\frac{GS_{2}}{TS_{2}}\times\left(\frac{GS_{1}}{TS_{1}}-\frac{GS_{2}}{TS_{2}}\right)\right|\cdot|TS_{1}|$ . This formula is larger than $|GS_{1}|$ (i.e., redundancy) in two cases: 1) $\frac{GS_{2}}{TS_{2}}-\frac{GS_{1}}{TS_{1}}\neq\emptyset$ (i.e., $\frac{GS_{2}}{TS_{2}}$ contains a loop that is not needed by $\frac{GS_{1}}{TS_{1}}$ ); 2) $|TS_{2}|<|TS_{1}|$ , which is only determined by the data mapping function of the downstream operator.

For Fig. 5, the outermost iteration space $\frac{GS_{2}}{TS_{2}}$ is spanned by loops {n, o2, h, ow}. The middle part $\frac{GS_{1}}{TS_{1}}-\frac{GS_{2}}{TS_{2}}$ is an empty set, where $\frac{GS_{1}}{TS_{1}}$ consists of {n, h, ow}. And, the innermost $TS_{1}$ involves loops {o1, ih1, iw1}. The first source of redundancy is the loop o2 $\in\left(\frac{GS_{2}}{TS_{2}}-\frac{GS_{1}}{TS_{1}}\right)$ , thus Conv1 is recomputed for $O_{2}$ times since a tile of Conv1 is reused by $O_{2}$ channels/tiles of Conv2. Moreover, Conv1 is recomputed for $\frac{H_{2}W_{2}\times R_{2}\times(15+C_{2})}{H_{1}W_{1}\times 1\times 16}$ times on the $H_{2}\times W_{2}$ dimensions because sliding-window operations in convolution have overlaps on $H_{1}$ and $W_{1}$ of Conv1 such that $|TS_{2}|<|TS_{1}|$ . And an overlapping region is reused by multiple tiles of Conv2.

We further illustrate the above issue in Fig. 6, where the yellow circle represents the output element $e$ of the upstream operator, while the blue circles denote the output elements $\{d_{1},d_{2},d_{3}\}$ of the downstream operator. Suppose the three blue elements reside in three separate data tiles after loop tiling. Then, with fusion, the yellow circle will be duplicated and stitched into three blue tiles, which leads to re-computation for three times. For the example in Fig. 5, $\{d_{1},d_{2},d_{3}\}$ can represent the $O_{2}$ output channels.

When $\{d_{1},d_{2},...,d_{k}\}$ are distributed in two or more data tiles, $e$ may be re-computed for each tile, thus leading to computation redundancy.

III-B2 Removing the Re-computation

As above, the re-computation will only occur under two conditions: 1) data reuse for the output tensor of the upstream operator; 2) two or more output data tiles in the downstream operator. The first condition cannot be removed, since it is determined only by the operator definition (e.g., convolution algorithm). By contrast, we can break the second condition by computing the downstream complex operator without loop tiling on the reused dimensions. Consequently, as the rightmost side in Fig. 6 illustrates, three blue elements reside in the same tile, and the yellow circle has only one out-going edge.

For common convolutions, the input tensor (i.e., the output tensor of the upstream operator) will be reused in three dimensions $O_{2},H_{2},W_{2}$ as that in Fig. 5. Without tiling, the original output data size of $O_{2}\times H_{2}\times W_{2}$ will normally be larger than the cache capacity, hence poor cache utilization during execution. Fortunately, there are widely used convolution operators on mobile devices without this concern, namely, depthwise convolution and pointwise convolution. The former does not perform reduction on the input channel dimension while the latter is free of reduction in kernels (i.e., $R_{2}=C_{2}=1$ ). Specifically, the input tensor will be reused only on $H_{2},W_{2}$ dimensions in depthwise convolution, and only $O_{2}$ dimension in pointwise convolution.

We achieve intensive fusion for these two categories in Fig. 7. We use uppercase letters to denote the original dimensions, while the corresponding lowercase letters to denote the tiled dimensions. When the downstream convolution is depthwise, as shown in Fig. 7(a), its data tile should be $H_{2}W_{2}o_{2}$ , where $H_{2},W_{2}$ dimensions are not tiled due to reuse and $o_{2}$ is a tiled dimension from $O_{2}$ . Then the data tile of the upstream convolution should be $H_{1}W_{1}o_{1}$ . In this case, we will also have $o_{1}=o_{2}$ since the number of input channels is the same as the number of output channels in the downstream depthwise convolution. Similarly, the tile of the downstream pointwise convolution can be denoted as $h_{2}w_{2}O_{2}$ in Fig. 7(b). Both of them is much smaller than $O_{2}\times H_{2}\times W_{2}$ . Notably in Fig. 7, the inter-operator data mapping is determined by the convolution algorithm. For example, in Fig. 7(b), the Conv2 tile $h_{2}w_{2}O_{2}$ requires a Conv1 tile $h_{1}w_{1}o_{1}$ and a Weight2 tile $o_{1}R_{2}C_{2}O_{2}$ for computation, where $h_{1}=h_{2}$ , $w_{1}=w_{2}$ , and $R_{2}=C_{2}=1$ . Additionally, no constraints are imposed on the inner-level tiling of the two convolutions, indicated as the red dashes in Fig. 7. In other words, if intensive fusion requires that two convolution tiles reside in L1 cache for locality, then the tiling of themselves on registers will not be affected.

Putting all the analysis together, our intensive fusion creates new optimization opportunities for complicated subgraphs, when the downstream convolution is depthwise or pointwise. We do not include the analysis of matrix multiplication because it is mathematically equivalent to pointwise convolution. Even if the downstream convolution type is unmet for intensive fusion, our tuner can still benefit from a larger tuning space given a complicated subgraph. Meanwhile, the tuning efficiency for such unmet complicated subgraphs will be addressed by the later reformer layer. Therefore, our tuner eschews the need of prior constraints on subgraph structures in favor of intensive fusion and joint optimization.

IV Graph Partitioning in Frontend

Given the original directed computational graph $G$ , the graph frontend will partition $G$ into many smaller subgraphs, each of which can be optimized separately and efficiently. Here, we refer to the graph partition as a collective term. It is formally defined as a set of subgraphs $S_{1},S_{2},...,S_{n}$ , such that the nodes in each subgraph are disjointed and each node in $G$ belongs to exactly one subgraph. Based on our powerful tuner, we allow an arbitrary structure for each $S_{i}$ . This further indicates that any edge in $G$ can potentially cross the cut in the partition. However, this can lead to 1) search space explosion and 2) cycles in the resulting graph partition. Larger search space will increase the optimization complexity, thus decrease the tuning efficiency for the tuner backend. To strike this issue, our solution involves two aspects. First, in the graph frontend, our partitioning algorithm will assign a weight for each operator and control the weight of each subgraph, hence obviating unreasonably huge subgraphs. Second, in the later reformer layer, we design a divide-and-conquer tuning mechanism to reduce the tuning complexity for each subgraph. Another issue is the cyclic dependency among subgraphs, which may disable some critical optimization techniques, e.g., data layout selection, and lead to deadlocks during runtime execution. To generate a cycle-free partition, we devise an iterative clustering algorithm, which will theoretically guarantee the acyclic property of the graph partitioning.

IV-A Weight Assignment for Operators

In a computational graph $G$ , we denote the set of nodes as $V$ and the set of directed edges as $E$ . Each node in the graph, denoted as $v$ , represents an operator in the model, and each of the directed edges, denoted as $e$ , represents a tensor produced by the source operator and consumed by the destination operator.

To avoid an unreasonably huge subgraph $S_{i}$ , we resort to measuring the tuning complexity of $S_{i}$ directly during partitioning. Previous works use indirect metrics as weights [5], e.g., the number of operators in a subgraph $S_{i}$ , which we find ineffective. Specifically, we observe that the contributions of operators in $S_{i}$ towards the tuning complexity are not the same and highly depend on their tensor shapes and operator types. We here conduct an experiment to study the relationship between the subgraph structure and the tuning complexity. We use tuning budget, which is the total number of explored schedules to obtain stable performance for a subgraph, as an indicator of the tuning complexity. We then tune different subgraphs, each of which contains different operators, and record their tuning budgets.

We report the results in Fig. 8, where the budget is on a scale of 100, the batch size is 1, the padding size in convolution is 1, the height/width of the convolutional window is 3, and the numbers behind $IOHW$ are the sizes of other corresponding dimensions. For example, in the second subgraph ( $Conv+Add$ ), the input tensor has shape ( $N$ =1, $I$ =32, $H$ =28, $W$ =28), the output tensor of $Conv$ operator has shape ( $N$ =1, $O$ =64, $H$ =28, $W$ =28). Based on Fig. 8, we have two observations. First, for each subgraph structure, the tuning budget does not scale directly with the number of operators, but illustrates a linear trend with tensor shapes. Second, for a given tensor shape, the tuning budget scales almost linearly with the number of operators, although the tuning space size increases exponentially with the number of operators.

The root cause of the first observation is that the most major optimization technique during tuning is loop transformation, e.g., loop tiling and fusion in Section III-B. Thus the tuning complexity is directly determined by the loop nest in the program for the operator. This further involves two folds: 1) the number of loops (e.g., seven nested loops in a 2-d convolution); 2) the extent of each loop. Thus, we define a fine-grained weight for each operator as follows, which measures the tuning complexity as a linear function of the loop nest:

w_{v}=c\times\prod_{l\in L_{v}}{\log(s_{l})}+b\,,

(1)

where $L_{v}$ is the set of loops for the operator $v$ , $s_{l}$ is the extent of the loop $l\in L_{v}$ , while $c$ is the slope and $b$ is the bias. With (1), it is easy to see a larger weight indicates higher complexity of tuning. Then, according to the second observation, the weight of a subgraph $S_{i}$ can be derived as the sum of the weights of all operators in $S_{i}$ . As illustrated by the black dash line in Fig. 8, we can almost perfectly fit the tuning budget with Eq. 1. Subsequently, we are able to guarantee a tractable size for each subgraph by setting up a threshold as the maximum weight. Moreover, such design helps eliminating trivial subgraphs that may waste tuning budgets and yielding balance among all subgraphs.

IV-B Acyclic Partitioning

After calculating weights, we here propose a new algorithm to address the issue of cyclic dependency. To allow arbitrary subgraph structures, the graph frontend can incur cycles in the resulting partition unexpectedly. For example, suppose $Conv1$ and $Conv3$ in Fig. 9 can trigger the intensive fusion. Then we put them in the same subgraph $S_{1}$ , while $Conv2$ constitutes another subgraph $S_{2}$ . In this case, $S_{1}$ and $S_{2}$ have inter-dependency in input tensors and output tensors. Such cycles can lead to deadlocks when executing these subgraphs at the runtime. Although prior works employ heuristics to produce subgraphs without cycles [5, 6], the generated subgraphs are over simplified and many opportunities are thus excluded. For example, three complex operators in Fig. 9 will be placed into three separate subgraphs [5, 6], hence missing opportunities of intensive fusion and joint optimization.

We first formally define the acyclic property for a graph partition. We call a partition without cycles a $n$ -way acyclic partition, if it satisfies the following property.

Definition 1.

$n$ -way acyclic partition: A $n$ -way acyclic partition contains $n$ disjoint sets of nodes $\{V_{1},V_{2},...V_{n}\}$ . For any $u,u^{\prime}\in V_{i}$ , $v,v^{\prime}\in V_{j}$ , $1\leq i\neq j\leq n$ , there cannot exist two paths, from $u$ to $v$ and from $v^{\prime}$ to $u^{\prime}$ , at the same time.

Further, we introduce the concept of topological stage as an identifier to the topological position of each node in $G$ .

Definition 2.

topological stage: The topological stage $ts_{v}\geq 1$ is an integer denoting the position of $v$ in $G$ . It can be calculated as the length of the longest path from the root $r$ (a node with zero in-degree) to the current $v$ .

It is easy to see, for any node $v\in V$ , $\forall e=(u,v)\in E$ , we have $ts_{u}<ts_{v}$ ; $\forall e=(v,w)\in E$ , we have $ts_{w}>ts_{v}$ .

Based on the concept of topological stage, we observe that for each node $v$ , there exists a set of nodes that can be safely grouped with it. We call such a special set affix set.

Definition 3.

affix set: We first denote the set of nodes that $v$ can connect to in the underlying undirected graph corresponding to $G$ as $UC_{v}$ . Then, the affix set $AS_{v}$ for node $v$ is a subset of $UC_{v}$ , such that each node in ${AS}_{v}$ satisfies one of the following two conditions:

		$\displaystyle\forall{u}\in{AS}_{v},$	$\displaystyle\,{ts}_{u}={ts}_{v}+1;$
		$\displaystyle\forall{u}\in{AS}_{v},$	$\displaystyle\,{ts}_{u}={ts}_{v}-1.\vspace{-0.1in}$

With the above definitions, we can derive the following theorem, which is the core of our later graph partitioning algorithm to guarantee the acyclic property.

Theorem 1.

Given a node $v$ and its affix set ${AS}_{v}$ , there will exist no cycles in the resulting graph partition if $v$ and any nodes in ${AS}_{v}$ cluster together to produce a new subgraph.

Proof.

We will prove the theorem by deducing a contradiction. Specifically, for any two nodes $u$ and $v$ in $G$ , assume that grouping $u$ and $v$ together will produce a cycle in the partition. Since the original graph $G$ is acyclic, the generated cycle indicates that there must exist a path $u\rightarrow p\rightarrow v$ in $G$ , where $p$ is another node. For example, in Fig. 9, the $Conv1,Conv2,Conv3$ operators are $u,p,v$ , respectively. However, we only group $u$ and $v$ if $u\in{AS}_{v}$ or $v\in{AS}_{u}$ . Thus, we have $|ts_{v}-ts_{u}|=1$ , which means the path from $u$ to $v$ or $v$ to $u$ must have no other nodes, i.e., $p$ does not exist. In summary, no cycle will be generated if $v$ and a node $u\in{AS}_{v}$ cluster together to generate a new subgraph. ∎

Algorithm 1 Graph Partition

1:function Cluster(

G

)

2: Initialize

Td,Cand,TopStage

3: Calculate weight for each

v

G

4: while

Cand\neq\emptyset

5: Choose

v\in Cand

with heaviest weight

6: if

\exists u\in AS_{v},\textit{s.t.},w_{v}+w_{u}<Td

then

u

and

v

cluster together as a hyper node

v^{\prime}

8: Move

v^{\prime}

Cand

9: else

10: Remove

v

from

Cand

11: end if

12: Update

E,TopStage

13: end while

14: Translate hyper nodes into subgraphs

\{S_{1},S_{2},...\}

15: return

\{S_{1},S_{2},...\}

16:end function

Based on Theorem 1, we can derive our cycle-free graph partitioning algorithm, in which affix operators iteratively cluster together to yield subgraphs. We illustrate the Cluster algorithm in Algorithm 1. At 2, we pre-process the graph $G$ to initialize some data structures, including the maximum weight threshold $Td$ for subgraphs, the initial candidate node set $Cand$ that contains all nodes in $G$ , and the information on topological stages $TopStage$ . Then we calculate the weight for each operator at 3. Next, we group affix nodes to generate subgraphs, and the weight of each subgraph is controlled via a greedy strategy (4 - 13). In each iteration, a node $v$ (a subgraph $S_{i}$ can be viewed as a hyper node $v$ ) with the heaviest weight in $Cand$ is selected (5). We then search in $AS_{v}$ to find a node $u$ with the smallest weight. If the sum of weights of $v$ and $u$ is smaller than the threshold $Td$ , a new subgraph will be produced. This subgraph is also viewed as a new hyper node $v^{\prime}$ , and put in the candidate set for further clustering (7 - 8). Otherwise, the node $v$ will be skipped and removed from $Cand$ (10). After each iteration, we will update the edge set $E$ and the position information $TopStage$ (12). The algorithm will repeat until the candidate set is empty. In this way, the resulting partition is guaranteed to be acyclic and each subgraph has a reasonable weight that is smaller than $Td$ .

V Divide-and-Conquer Tuning in Reformer Layer

The search space of a subgraph with a complicated structure is still large even if we limit its weight during graph partitioning. To achieve efficient tuning, we insert a reformer layer between the graph frontend and the tuner backend to orchestrate different components. The reformer layer exploits a divide-and-conquer mechanism to break down the complicated subgraph tuning into smaller sub-tasks and then combine them. As the dividing stage, we design a Split function. By invoking Cluster in Algorithm 1, the Split function further splits each large subgraph $S_{i}$ into several mini-subgraphs $\{M_{i1},M_{i2},...\}$ . Each mini-subgraph has at most one complex operator and a smaller weight. Then, as the conquering stage, we devise a Join function. It can combine those mini-subgraphs $\{M_{i1},M_{i2},...\}$ back to $S_{i}$ for further tuning.

The reformer layer will immediately execute the Split function after graph partitioning. Then, by inspecting the feedbacks from the tuner backend, the reformer layer will call the Join function to combine those mini-subgraphs $M_{ij}$ back as $S_{i}$ , if the tuning for each $M_{ij}(1\leq j\leq m_{i})$ tends to stabilize. During joining, the schedules searched by the tuner for each mini-subgraph will also be composed as a large schedule for $S_{i}$ . When delivering $S_{i}$ to the tuner backend, this combined schedule will be treated as the initial schedule to evade inefficient tuning from the scratch for $S_{i}$ .

Our cycle-free partitioning function Cluster only limits the maximum weight for each subgraph without other constraints on structures. Further, with the Split and Join functions, AGO is able to optimize any subgraph efficiently.

VI Evaluation

We implemented the graph frontend in C++ based on TVM (0.8dev1) [3], and the reformer layer and the tuner in both Python and C++. In this section, we evaluate AGO on two mobile CPU platforms with Kirin 990 SoC (Android v10), representing high-end devices, and Qualcomm snapdragon (Qsd) 810 SoC (Android v8), representing low-end devices with strict resource constraints. We compare AGO with Torch Mobile [18] and Ansor [9]. Torch Mobile is a widely used deep learning framework, which employs a hand-tuned high-performance library XNNPACK [21] developed by Google. While Ansor is the state-of-the-art auto-tuning framework based on TVM, which performs better than TensorFlow Lite [19, 9]. Thus, TensorFlow Lite is not included in our benchmarks.

VI-A End-to-End Performance

In this subsection, we evaluate AGO in terms of end-to-end inference performance. Our benchmarks cover four classical neural networks: MobileNet-V2 (MBN) [11], MNasNet (MNSN) [12], SqueezeNet (SQN) [13], and ShuffleNet-V2 (SFN) [14], and two emerging new networks: Bert-tiny (BT) [15, 16] and MobileViT (MVT) [17]. These networks are lightweight and widely used for mobile deep learning services. For both Kirin 990 SoC and Qsd 810 SoC, we set the batch size to 1 for all input tensors due to constrained computing power, which is also a general setting for mobile inference. Besides, we test different shapes of the input tensor for each classical network: small shape ( $N$ =1, $I$ =3, $H$ =56, $W$ =56), middle shape ( $N$ =1, $I$ =3, $H$ =112, $W$ =112), and large shape ( $N$ =1, $I$ =3, $H$ =224, $W$ =224). These shapes represent various workloads due to divergent image resolutions in real applications. For the new language model BT, we set the input sequence length to 128, which is the longest sequence it supports. For the new model MVT, we only evaluate it on the large shape, which is the image size of the Imagenet dataset [22]. All networks are executed with float32 precision. Besides, we set the search budget for AGO and Ansor to 20,000, which is suggested by Ansor [9] for sufficient tuning.

We report the speedup of each method over Torch Mobile for classical networks in Fig. 10 and Fig. 11, where the number on the top of each bar is the raw latency in milliseconds. On the Qsd 810 SoC, AGO achieves an average speedup of $1.5\times$ , $1.6\times$ , and $1.8\times$ over Torch Mobile on three input tensor shapes respectively. Compared with Ansor, AGO achieves an average speedup of $1.2\times$ on each input shape. The main reason behind the significant improvement over Torch Mobile is that, hand-tuned libraries often put tremendous engineering efforts on optimizing typical workloads, while other non-typical operators are less optimized. The speedup over Ansor mainly originates from the intensive fusion and joint optimization for complicated subgraphs, which are the opportunities missed by Ansor. For example, when there are many subgraphs with consecutive pointwise and depthwise convolutions, AGO achieves an average of $1.3\times$ speedup over Ansor.

Similarly, on the Kirin 990 SoC, AGO achieves average $1.9\times$ , $2.1\times$ , and $1.5\times$ speedup over Torch Mobile on three input tensor shapes respectively. Compared with Ansor, AGO achieves average $2.6\times$ , $1.6\times$ , and $1.1\times$ speedup respectively. Again, such improvements are directly owing to our intensive fusion and joint optimization. By contrast, Ansor suffers from a limited tuning space due to the simple subgraph structures generated by Relay [5]. For example, AGO outperforms both baselines on MNSN significantly, which involves massive pointwise and depthwise convolutions. Both Torch Mobile and Ansor can only perform conventional fusion, while AGO can achieve either intensive fusion or joint optimization.

We further employ AGO to optimize BT and MVT, which are two new networks. We report the results in Fig. 12, where we do not test MVT on the Qsd 810 SoC due to its limited resources. Compared with Torch Mobile, AGO improves the performance by $38.2\%$ on BT and $34.3\%$ on MVT, respectively. Further, compared with Ansor, AGO improves the performance by $20.5\%$ on BT and $29.1\%$ on MVT. In summary, AGO can be used to boost new neural architectures readily without any interference.

Additionally, the tuning budget of 20,000 implies up to a day of the compilation time. But this is affordable to practitioners since they only need to execute AGO once before the long-run deployment. Moreover, it is much shorter than weeks or even months of hand-tuning.

VI-B Micro Benchmark

In this subsection, we further study where the performance gain of AGO comes from, to evaluate our intensive fusion and reformer layer. Then, we will evaluate our graph partitioning algorithm, by respectively inspecting the generated subgraphs partitioned by our algorithm and Relay [5].

We first compare three variants of AGO to break down the improvements: 1) AGO-NI (no intensive fusion in the tuner backend); 2) AGO-NR (no reformer layer, i.e., tuning a large subgraph directly); 3) AGO, same as Section VI-A, as the baseline. We then evaluate them on four subgraphs. Each subgraph consists of two complex operators and some other simple operators. The complex operator is either pointwise convolution or depthwise convolution. Except the subgraph with two depthwise convolutions, other subgraphs are extracted from MBN and MNSN. Additionally, the tuning budget is 2,000 for each variant and subgraph.

We present the results in Fig. 13, where the number behind $B$ is the batch size. AGO-NI has average $17\%$ performance loss compared with AGO on two platforms. The reason is that AGO can fuse multiple complex operators to further improve the performance, while AGO-NI only optimizes them jointly with conventional fusion. Further, AGO-NR has nearly 27% performance loss. This is because directly optimizing a complicated subgraph is hard, while AGO addresses this issue through the divide-and-conquer tuning mechanism. We also observe that there are some cases where AGO-NI outperforms AGO Fig. 13(d). This indicates that AGO cannot find better schedules due to the increased search space size after joining mini-subgraphs, hence inefficient budget usage in such workloads. Thus, this issue can be addressed by prolonging the tuning for mini-subgraphs before joining.

Next, we evaluate our graph partitioning algorithm. We present the subgraph weight distribution for MVT in Fig. 14. We construct ten weight bins in log scale (e.g., bin $[1,2)$ means weight interval $[2^{1},2^{2})$ ), and report the number of subgraphs in each bin. The new model MVT integrates attention modules, yielding a large number of reshape and transpose operators. Relay will heuristically take such operators as delimiters to produce totally 259 subgraphs, where 105 of them are trivial and have a weight less than 20. Besides, the average weight, the median weight, and the Jain’s fairness index (measuring balance and higher is better) are 138, 23, and 0.19, respectively. In contrast, AGO generates 82 subgraphs. Most of them have a large weight as shown in Fig. 14. For AGO, the average weight, the median weight, and the Jain index are 437, 350, and 0.55, respectively. Therefore, our partitioning algorithm can generate more complicated subgraphs while maintaining balance. Take a typical structure in MVT as an example, which contains eight consecutive operators: matrix multiplication, reshape, add, reshape, transpose, reshape, matrix multiplication, and reshape. Relay produces five fragmented subgraphs for this structure, missing opportunities of intensive fusion for two matrix multiplications and joint optimization for all simple operators. Such partition leads to inferior performance since the reshape/transpose operators involve expensive memory loads/stores. By contrast, AGO typically groups such operators together to boost the performance.

VII Related Work

Deep learning compiler: Various deep compilers have been proposed to address the error-prone manual optimization [2, 3, 7, 23, 24, 8, 9, 25, 26, 27, 28, 29, 30, 10, 31, 32]. Tensor Comprehension [23] and TVM [3] adopt the idea of decoupling optimization from operator description to simplify the auto-tuning process, which is then provided by AutoTVM [7], FlexTensor [8], Ansor [9], ALT [32] via search algorithms. TASO [24], Tensat [30], and PET [31] perform graph substitutions to generate more efficient graphs. To speed up the tuning, AKG [10] applies the polyhedron model, while delicate cost models [26, 27] and heuristics [28, 33] are also proposed. Compared with AGO, 1) their tuners do not support intensive fusion; 2) their frontends take the tuner as a black box and no cross-layer mechanism is involved; 3) their graph partitioning algorithms impose unnecessary constraints on subgraph structures, thus can only generate simple subgraphs.

Operator fusion: Many systems exploit operator fusion as an important optimization technique [25, 34, 6, 35, 10, 36]. In general, they can fuse a complex operator with its following simple operators, which is named as conventional fusion in this work. For instance, [36] can fuse memory-bound operators for NVIDIA GPU based on XLA [2]. [10, 6] exploits the polyhedral model to explore the fusion opportunity. Although two matrix multiplications can also be fused on NVIDIA GPU in Bolt [35], it is implemented based on the vendor library CUTLASS [37] and the fact that the cache (shared memory) in NVIDIA GPU is programmable. Thus, Bolt can only fuse matrix multiplications that CUTLASS supports, while cannot fuse general complex operators on CPUs. Compared with these works, AGO enables generic auto-tuned intensive fusion on mobile devices, while guaranteeing efficiency via careful analysis on computation redundancy.

Computational graph partitioning: Classical graph partitioning has been extensively studied [38], typically in distributed computing area. In deep learning systems [39, 40, 41, 42, 43], they are often used to increase the parallelism to improve performance. For instance, IOS [39] and Unity [43] exploit partitioning to parallelize subgraphs on NVIDIA GPU. [40] partitions the graph to reduce the peak runtime memory footprint. SPINN [41], CLIO [42], and Walle [44] partition a computational graph into two parts, with one part running on the device and the other running on the cloud/edge server. Compared with these systems, the major goal of our graph partitioning is to improve mobile inference performance via compilation techniques while keeping acyclic theoretically.

VIII Conclusion

We propose AGO, a framework that removes the constraints on graph optimization to boost the inference performance of AI models on mobile devices. AGO provides a new partitioning scheme to generate arbitrary subgraphs while keeping acyclic. It also designs a potent tuner which proposes intensive operator fusion and joint optimization to boost arbitrary subgraphs. Additionally, AGO devises a divide-and-conquer mechanism to address the tuning efficiency. Experiments show that AGO significantly outperforms state-of-the-art hand-tuned libraries and makes great progress over auto-tuning frameworks. For researchers, they can use AGO to further explore the performance characteristics of complicated subgraphs. For practitioners, they can easily exploit AGO to improve the inference performance without human interference.

IX Acknowledgement

We thank the anonymous reviewers of INFOCOM 2023 for their valuable comments. This work is partially supported by the National Natural Science Foundation of China under Grant Number 62272213 and the Jiangsu Innovation and Entrepreneurship (Shuangchuang) Program.

References

[1] X. Jiang, H. Wang, Y. Chen, Z. Wu, L. Wang, B. Zou, Y. Yang, Z. Cui, Y. Cai, T. Yu, C. Lv, and Z. Wu, “MNN: A universal and efficient inference engine,” in Proceedings of MLSys, 2020.
[2] C. Leary and T. Wang, “Xla: Tensorflow, compiled,” TensorFlow Dev Summit, 2017.
[3] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze et al., “TVM: An automated end-to-end optimizing compiler for deep learning,” in Proceeding of USENIX OSDI, 2018, pp. 578–594.
[4] R. Baghdadi, J. Ray, M. B. Romdhane, E. Del Sozzo, A. Akkas, Y. Zhang, P. Suriana, S. Kamil, and S. Amarasinghe, “Tiramisu: A polyhedral compiler for expressing fast and portable code,” in Proceedings of IEEE/ACM CGO, 2019, pp. 193–205.
[5] J. Roesch, S. Lyubomirsky, L. Weber, J. Pollock, M. Kirisame, T. Chen, and Z. Tatlock, “Relay: A new IR for machine learning frameworks,” in Proceedings of ACM MAPL, 2018, pp. 58–68.
[6] J. Zhao, X. Gao, R. Xia, Z. Zhang, D. Chen, L. Chen, R. Zhang, Z. Geng, B. Cheng, and X. Jin, “Apollo: Automatic partition-based operator fusion through layer by layer optimization,” in Proceedings of MLSys, 2022.
[7] T. Chen, L. Zheng, E. Yan, Z. Jiang, T. Moreau, L. Ceze, C. Guestrin, and A. Krishnamurthy, “Learning to optimize tensor programs,” in Proceedings of NeurIPS, 2018, pp. 3389–3400.
[8] S. Zheng, Y. Liang, S. Wang, R. Chen, and K. Sheng, “Flextensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system,” in Proceedings of ACM ASPLOS, 2020, pp. 859–873.
[9] L. Zheng, C. Jia, M. Sun, Z. Wu, C. H. Yu, A. Haj-Ali, Y. Wang, J. Yang, D. Zhuo, K. Sen et al., “Ansor: generating high-performance tensor programs for deep learning,” in Proceedings of USENIX OSDI, 2020, pp. 863–879.
[10] Z. Jie, Li, Bojie, N. Wang, G. Zhen, Z. Renwei, G. Xiong, C. Bin, W. Chen, C. Yun, L. Zheng, D. Peng, Z. Kun, and J. Xuefeng, “Akg: automatic kernel generation for neural processing units using polyhedral transformations,” in Proceedings of ACM PLDI, 06 2021, pp. 1233–1248.
[11] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “MobileNets: Efficient convolutional neural networks for mobile vision applications,” 2017.
[12] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le, “MnasNet: Platform-aware neural architecture search for mobile,” in Proceedings of IEEE/CVF CVPR, June 2019.
[13] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <0.5mb model size,” 2016.
[14] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “Shufflenet v2: Practical guidelines for efficient cnn architecture design,” in Proceedings of ECCV, 2018, pp. 116–131.
[15] I. Turc, M. Chang, K. Lee, and K. Toutanova, “Well-read students learn better: The impact of student initialization on knowledge distillation,” CoRR, vol. abs/1908.08962, 2019.
[16] P. Bhargava, A. Drozd, and A. Rogers, “Generalization in nli: Ways (not) to go beyond simple heuristics,” 2021.
[17] S. Mehta and M. Rastegari, “Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer,” in Proceedings of ICLR, 2022.
[18] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” in Proceedings of NeurIPS, 2019, pp. 8026–8037.
[19] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for large-scale machine learning,” in Proceedings of USENIX OSDI, 2016, pp. 265–283.
[20] M. Wang, S. Ding, T. Cao, Y. Liu, and F. Xu, “Asymo: scalable and efficient deep-learning inference on asymmetric mobile cpus,” in Proceedings of ACM MobiCom, 2021, pp. 215–228.
[21] Google, “Xnnpack: Highly optimized library of floating-point neural network inference operators for arm, webassembly, and x86 platforms,” 2021. [Online]. Available: https://github.com/google/XNNPACK
[22] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proceedings of CVPR. Ieee, 2009, pp. 248–255.
[23] N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito, W. S. Moses, S. Verdoolaege, A. Adams, and A. Cohen, “Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions,” arXiv preprint arXiv:1802.04730, 2018.
[24] Z. Jia, O. Padon, J. Thomas, T. Warszawski, M. Zaharia, and A. Aiken, “Taso: optimizing deep learning computation with automatic generation of graph substitutions,” in Proceedings of SOSP, 2019, pp. 47–62.
[25] Z. Zheng, P. Zhao, G. Long, F. Zhu, K. Zhu, W. Zhao, L. Diao, J. Yang, and W. Lin, “Fusionstitching: boosting memory intensive computations for deep learning workloads,” arXiv preprint arXiv:2009.10924, 2020.
[26] Y. Wang, X. Zhou, Y. Wang, R. Li, Y. Wu, and V. Sharma, “Tuna: A static analysis approach to optimizing deep neural networks,” arXiv preprint arXiv:2104.14641, 2021.
[27] R. Li, Y. Xu, A. Sukumaran-Rajam, A. Rountev, and P. Sadayappan, “Analytical characterization and design space exploration for optimization of cnns,” in Proceedings of ACM ASPLOS, 2021, pp. 928–942.
[28] B. Steiner, C. Cummins, H. He, and H. Leather, “Value learning for throughput optimization of deep learning workloads,” Proceedings of MLSys, vol. 3, 2021.
[29] R. Baghdadi, M. Merouani, M.-H. Leghettas, K. Abdous, T. Arbaoui, K. Benatchba et al., “A deep learning based cost model for automatic code optimization,” Proceedings of MLSys, vol. 3, 2021.
[30] Y. Yang, P. Phothilimthana, Y. Wang, M. Willsey, S. Roy, and J. Pienaar, “Equality saturation for tensor graph superoptimization,” Proceedings of MLSys, vol. 3, 2021.
[31] H. Wang, J. Zhai, M. Gao, Z. Ma, S. Tang, L. Zheng, Y. Li, K. Rong, Y. Chen, and Z. Jia, “PET: Optimizing tensor programs with partially equivalent transformations and automated corrections,” in Proceedings of USENIX OSDI. USENIX Association, Jul. 2021, pp. 37–54.
[32] Z. Xu, J. Xu, H. Peng, W. Wang, X. Wang, H. Wan, H. Dai, Y. Xu, H. Cheng, K. Wang et al., “Alt: Breaking the wall between graph and operator level optimizations for deep learning compilation,” arXiv preprint arXiv:2210.12415, 2022.
[33] H. Zhu, R. Wu, Y. Diao, S. Ke, H. Li, C. Zhang, J. Xue, L. Ma, Y. Xia, W. Cui et al., “ $\{$ ROLLER $\}$ : Fast and efficient tensor compilation for deep learning,” in Proceedings of USENIX OSDI, 2022, pp. 233–248.
[34] W. Niu, J. Guan, Y. Wang, G. Agrawal, and B. Ren, “Dnnfusion: Accelerating deep neural networks execution with advanced operator fusion,” in Proceedings of ACM PLDI, 2021, p. 883–898.
[35] J. Xing, L. Wang, S. Zhang, J. Chen, and Y. Zhu, “Bolt: Bridging the gap between auto-tuners and hardware-native performance,” arXiv e-prints, 2021.
[36] Z. Zheng, X. Yang, P. Zhao, G. Long, K. Zhu, F. Zhu, W. Zhao, X. Liu, J. Yang, J. Zhai et al., “Astitch: enabling a new multi-dimensional optimization space for memory-intensive ml training and inference on modern simt architectures,” in Proceedings of ACM ASPLOS, 2022, pp. 359–373.
[37] Nvidia, “Nvidia/cutlass: Cuda templates for linear algebra subroutines.” [Online]. Available: https://github.com/NVIDIA/cutlass
[38] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, “Pregel: a system for large-scale graph processing,” in Proceedings of MOD, 2010, pp. 135–146.
[39] Y. Ding, L. Zhu, Z. Jia, G. Pekhimenko, and S. Han, “Ios: Inter-operator scheduler for cnn acceleration,” 2021.
[40] B. H. Ahn, J. Lee, J. M. Lin, H.-P. Cheng, J. Hou, and H. Esmaeilzadeh, “Ordering chaos: Memory-aware scheduling of irregularly wired neural networks for edge devices,” 2020.
[41] S. Laskaridis, S. I. Venieris, M. Almeida, I. Leontiadis, and N. D. Lane, “Spinn: synergistic progressive inference of neural networks over device and cloud,” in Proceedings of ACM MobiCom, 2020, pp. 1–15.
[42] J. Huang, C. Samplawski, D. Ganesan, B. Marlin, and H. Kwon, “Clio: Enabling automatic compilation of deep learning pipelines across iot and cloud,” in Proceedings of ACM MobiCom, 2020, pp. 1–12.
[43] C. Unger, Z. Jia, W. Wu, S. Lin, M. Baines, C. E. Q. Narvaez, V. Ramakrishnaiah, N. Prajapati, P. McCormick, J. Mohd-Yusof, X. Luo, D. Mudigere, J. Park, M. Smelyanskiy, and A. Aiken, “Unity: Accelerating DNN training through joint optimization of algebraic transformations and parallelization,” in Proceedings of USENIX OSDI. Carlsbad, CA: USENIX Association, Jul. 2022, pp. 267–284.
[44] C. Lv, C. Niu, R. Gu, X. Jiang, Z. Wang, B. Liu, Z. Wu, Q. Yao, C. Huang, P. Huang, T. Huang, H. Shu, J. Song, B. Zou, P. Lan, G. Xu, F. Wu, S. Tang, F. Wu, and G. Chen, “Walle: An End-to-End, General-Purpose, and Large-Scale production system for Device-Cloud collaborative machine learning,” in Proceedings of USENIX OSDI. Carlsbad, CA: USENIX Association, Jul. 2022, pp. 249–265.