This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

AGO: Boosting Mobile AI Inference Performance by Removing Constraints on Graph Optimization

Zhiying Xu1, Hongding Peng1, Wei Wang2 1Nanjing University, China {zyxu, mg20330044}@smail.nju.edu.cn 2Nanjing University, China [email protected]
Abstract

Traditional deep learning compilers rely on heuristics for subgraph generation, which impose extra constraints on graph optimization, e.g., each subgraph can only contain at most one complex operator. In this paper, we propose AGO, a framework for graph optimization with arbitrary structures to boost the inference performance of deep models by removing such constraints. To create new optimization opportunities for complicated subgraphs, we propose intensive operator fusion, which can effectively stitch multiple complex operators together for better performance. Further, we design a graph partitioning scheme that allows an arbitrary structure for each subgraph while guaranteeing the acyclic property among all generated subgraphs. Additionally, to enable efficient performance tuning on complicated subgraphs, we devise a novel divide-and-conquer tuning mechanism to orchestrate different system components. Through extensive experiments on various neural networks and mobile devices, we show that our system can improve the inference performance by up to 3.3×3.3\times when compared with state-of-the-art deep compilers.

Equal contributions.

I Introduction

Deep learning has become an essential building block for various mobile applications, such as machine translation and recommendation systems. The user experience of these mobile applications are critically impacted by the efficiency of running deep learning inference tasks on mobile devices. Therefore, code optimization of tensor operators, such as convolution and matrix multiplication, becomes an important research issue for mobile systems. While manual optimization can lead to order-of-magnitude reduction in inference delay [1], it typically incurs tremendous human efforts, as the tuning process highly depends on the specific hardware architecture as well as the neural network structure. To relieve developers from the burden of hand-tuning, researchers design deep learning compilers, such as XLA [2], TVM [3], and Tiramisu [4], to perform automatic code optimization with compilation and auto-tuning techniques. In these compilers, each operator is represented as a node in a computational graph, and the tuning result for an operator is called a schedule.

The system design of deep compilers can be divided into two major layers. On the top layer, a graph frontend partitions the computational graph into multiple subgraphs. In contrast with the huge optimization space of the whole graph, optimizing each subgraph separately is more manageable. To achieve this, existing graph frontends, such as Relay [5] and Apollo [6], group adjacent operators into the same subgraph according to specific heuristics. As a result, a subgraph can contain at most one complex operator (e.g., convolution and matrix multiplication) and other simple operators (e.g., padding, add, and ReLU). The bottom layer is tuner backend, where enormous schedules are explored in the tuning space of each subgraph separately. In general, the best schedule is composed of the optimal parameters for various code optimization techniques, e.g., selected data layouts, tile sizes for loop tiling, and operator fusion schemes. Existing tuners either exploit search-based methods [7, 8, 9] or polyhedral models [10, 6] to achieve such exploration. After exploration, the compiler will generate an optimized tensor program based on the best schedule.

Unfortunately, existing works cannot handle arbitrary graphs efficiently, especially when facing emerging new neural models that tend to use complex structures [11, 12, 13, 14, 15, 16, 17]. First, the graph frontend heavily relies on offline hard-coded heuristics to perform graph partitioning. These heuristics only produce simple subgraph structures while overlooking other optimization opportunities. They also do not have enough scalability to cater to new neural networks. Second, the tuner implicitly limits itself to a small tuning space, because the space is bounded by the simple subgraph generated by the frontend. Such strict constraints on graph optimization seriously compromises the inference performance. For example, existing frontends cannot generate subgraphs with multiple complex operators to enable intensive fusion and joint optimization. By contrast, we observe that removing this constraint can achieve up to 3.3×3.3\times speedup for end-to-end inference.

This paper proposes AGO, a framework that enables arbitrary structure graph optimization to boost the inference performance of mobile deep learning. In the top graph frontend, AGO exploits a new weighted clustering algorithm to perform graph partitioning, each of the generated subgraphs is free of prior constraints and may contain multiple complex operators. In the bottom backend, AGO designs a more powerful tuner, which can automatically explore schedules for any subgraph. Additionally, different from prior arts, AGO incorporates an extra middle reformer layer to orchestrate the frontend and the backend for efficient subgraph optimization.

We need to address three unique challenges to achieve the arbitrary structure graph optimization.

Challenge 1: How to remove the constraints on subgraph structures while keeping the network acyclic? Allowing arbitrary subgraph structures means that any edge in the original graph can cross a cut in the partition. However, this can lead to cycles among the generated subgraphs. Cyclic dependencies will result in deadlocks when executing these subgraphs at the runtime. To address this issue, we analytically scrutinize the inter-subgraph data dependency given a directed computational graph. We then show that we can safely group operators in the affix set without generating cycles. Based on our analysis, we devise an iterative clustering algorithm achieving the acyclic property in the graph frontend.

Challenge 2: How to efficiently tune arbitrary subgraphs? The primary hurdle to optimize a subgraph with any structure lies in the case that multiple complex operators reside in the same subgraph, which we call complicated subgraphs. Although a complicated subgraph can expand the search space to open more optimization opportunities, directly exploring schedules in such a large space will incur formidable computational costs, hence inefficient tuning. We use two schemes to improve the tuning efficiency. First, when generating subgraphs in the graph frontend, we assign a weight for each operator via systematically modeling the relationship between the subgraph structure and the tuning complexity. Therefore, we can easily avoid unreasonably huge subgraphs by suppressing the weight. Second, during tuning, we propose a divide-and-conquer mechanism in the reformer layer to handle a complicated subgraph. The reformer layer further splits a subgraph into several mini-subgraphs, each of which is small to be tuned efficiently. After several rounds of tuning mini-subgraphs, the reformer layer will join them back as a large subgraph for further optimization.

Challenge 3: How to create new optimization opportunities given a complicated subgraph? One of the most influential tuning techniques to optimize a subgraph is to fuse operators together so that expensive memory accesses can be reduced. However, different from the conventional operator fusion, fusing multiple complex operators in a complicated subgraph can induce redundant computation, which poses an enormous challenge for the bottom tuner. To avoid the dilemma between the redundancy of fusion and the insufficient optimization without fusion, we systematically analyze the inter-operator data dependency in complicated subgraphs. We then discover two categories of subgraph structures that can enable operator fusion while obviating re-computation, which we call intensive fusion. Therefore, we can exploit new optimization techniques when a complicated subgraph falls into one of the two categories. Additionally, when this condition is unmet, our tuner can still benefit from joint optimization for all operators in a complicated subgraph, while the tuning efficiency is already emphasized by the reformer layer.

In summary, we make the following contributions:

\bullet We reveal that the strict constraints imposed on graph optimization is a major obstacle for deep learning compilers to catering to emerging complicated neural architectures.

\bullet We design a weighted clustering algorithm to perform graph partitioning in the frontend, which removes the constraints on subgraph structures while guaranteeing the acyclic property in the resulting partition.

\bullet We devise a divide-and-conquer tuning mechanism to efficiently tune complicated subgraphs, which serves as a middle layer to orchestrate the frontend and the backend.

\bullet We craft a powerful tuner in the backend, which automatically and effectively optimizes complicated subgraphs through the proposed intensive fusion technique.

We integrate AGO into an existing deep compiler and conduct extensive experiments for evaluation. The results show that AGO improves the inference performance by up to 3.3×3.3\times compared with state-of-the-art hand-tuned libraries and auto-tuning frameworks, e.g., Torch Mobile [18] and Ansor [9].

II System Overview

Refer to caption
Figure 1: An illustrative computational graph.
Refer to caption
Figure 2: System overview of AGO.

The input for a deep compiler is a deep learning model, which can be represented as a computational graph where operators and tensors are denoted as nodes and edges respectively, as illustrated in Fig. 1. Complex operators, such as convolution and matrix multiplication operators, are represented as green nodes in Fig. 1, while the orange nodes represent simple operators, e.g., add, ReLU, and normalization. In prior deep compilers [3, 5, 9, 6], the computational graph is first partitioned by heuristics into many small subgraphs. Each subgraph in these frameworks can contain at most one complex operator. Thus, op1op_{1} and op2op_{2} must reside in two different subgraphs, although they share the same input tensor and can be stitched together to improve data locality. Operators op3op_{3} and op4op_{4} may constitute another subgraph, even if such no-complex subgraph is trivial, hence no room for performance tuning. The other branch in Fig. 1 contains two complex operators, op5op_{5} and op7op_{7}, which are forced to be partitioned into two subgraphs although combining all the three operators may benefit from intensive operator fusion. Therefore, existing heuristics generate unbalanced, small, and simple subgraph structures and hinder further optimization opportunities.

We observe that such inefficient partitioning originates from two aspects. First, the heuristics employed in the graph frontend introduce many unnecessary constraints on the subgraph structure. Second, the underlying tuner backend cannot handle complicated subgraph structures due to the over-simplification of the tuning space. Both of them contribute significantly.

To address this issue, AGO exploits a weighted clustering algorithm to perform graph partitioning which allows arbitrary subgraph structures. Moreover, AGO designs a more powerful tuner achieving intensive fusion to handle any complicated subgraphs. For instance, op1op_{1}, op2op_{2}, op3op_{3}, and op4op_{4} can be grouped together for operator fusion and joint optimization. Also, we can place op5op_{5}, op6op_{6}, and op7op_{7} in the same subgraph in the frontend, and then intensively fuse them in the backend to further boost the performance.

The workflow of AGO is illustrated in Fig. 2.

  1. 1.

    Given a model file generated by common deep learning frameworks (e.g., TensorFlow [19]), the graph frontend first resolves it into a computational graph GG.

  2. 2.

    The frontend partitions operators in GG into nn subgraphs, each of which is denoted as SiS_{i}, where 1in1\leq i\leq n.

  3. 3.

    In the reformer layer, AGO further splits each SiS_{i} into mim_{i} mini-subgraphs, each of which is denoted as MijM_{ij}, where 1jmi1\leq j\leq m_{i}.

  4. 4.

    We then offload each MijM_{ij} as a tuning task for the tuner backend.

  5. 5.

    After preliminary mini-subgraph optimization, the backend provides the tuned schedules as feedbacks to the reformer layer.

  6. 6.

    Depending on the feedback, the reformer layer joins the selected mini-subgraphs MijM_{ij} back as a large subgraph.

  7. 7.

    Each of the merged subgraph SiS_{i} becomes a new tuning task for the tuner backend.

  8. 8.

    Finally, after optimizing each subgraph SiS_{i}, we will generate more efficient codes based on the tuned schedules.

Next, we will elaborate our system in a bottom-up manner.

III Subgraph Optimization in Backend

In deep models, a tensor operator can be implemented as deeply-nested loops in the source code. Thus most optimization techniques can be achieved by loop transformation. For instance, we can split and reorder loops to achieve loop tiling, or merge loops of two operators for operator fusion. In practice, tiling and fusion are nearly the most two influential techniques for performance optimization. Specially, operator fusion works on multiple operators, hence is dependent on subgraph structures. To optimize arbitrary subgraphs, our major solution in the tuner backend is to craft a new operator fusion scheme, named as intensive operator fusion.

In this section, we first introduce how conventional operator fusion works in a mini-subgraph MijM_{ij}. Then, we will depict the intensive fusion and how to exploit it for a subgraph SiS_{i}.

III-A Conventional Operator Fusion for Mini-subgraph

Split from a subgraph by the later reformer layer, each mini-subgraph MijM_{ij} contains at most one complex operator. Suppose MijM_{ij} contains three operators: 2-d convolution, bias addition, and ReLU, which is a typical workload for recent convolutional neural networks. In the following, we will use this mini-subgraph as an example to illustrate how operator fusion improves its computational efficiency.

for n in range(N):
for o in range(O):
for h in range(H):
for w in range(W):
Conv[n,c,h,w] = 0.0
for ri, rr, rc in range(I, R, C):
Conv[n,o,h,w] += \ Inp[n,ri,h+rr,w+rc] * Weight[o,ri,rr,rc]
for n, o, h, w in range(N, O, H, W):
Sum[…] = Conv[…] + Bias[o]
for n, o, h, w in range(N, O, H, W):
ReLU[…] = max(Sum[…], 0.0)
Figure 3: Loop nest of a mini-subgraph without fusion.

We present the initial loop nest of MijM_{ij} in Fig. 3 111Direct convolution is preferable than matrix multiplication based implementation [20]. The latter is often used when lacking direct library supports. , where N,O,H,WN,O,H,W represent the batch size, the number of output channels, the height, and the width of the output tensor, respectively. II is the number of input channels, and R,CR,C are the height and the width of the convolutional window. Besides, the three reduction loops I,R,CI,R,C are written in one line for simplicity. In this program, the bias addition is executed after the whole convolution. When the Conv tensor is large, most of its elements will have been spilled out of the cache at bias addition. Subsequently, we must fetch these elements from the main memory into cache again when performing the addition. Such many additional cache misses lead to poor performance, especially on mobile devices with small caches.

for n, o, h, w in range(N, O, H, W):
Conv[n, o, h, w] = 0.0
for ri, rr, rc in range(I, R, C):
Conv[…] += Inp[…] * Weight[…]
Sum[…] = Conv[…] + Bias[o]
ReLU[…] = max(Sum[…], 0.0)
Figure 4: Conventional fusion within a mini-subgraph.

We can perform operator fusion within MijM_{ij} to strike this issue, as illustrated in Fig. 4. Once each element of the Conv tensor is calculated, the following addition and ReLU operations will be performed. Thus, data elements are immediately consumed by downstream operators while still in cache, hence improved operational intensity and inter-operator data locality.

III-B Intensive Operator Fusion for Subgraph

Conventional operator fusion only stitches a complex operator with its following simple operators, hence is also named as epilogue fusion. To optimize subgraphs with complicated structures, we propose intensive fusion, which allows fusing multiple complex operators together. Intensive fusion can further improve the inter-complex-operator data locality without the strict constraints on the subgraph structure, i.e., only one complex operator is allowed in prior schemes.

Unfortunately, the profits from intensive fusion are not free. Since a complex operator involves reduction, the fusion-after-tiling optimization path often induces redundant computation.

In this subsection, we first systematically analyze the inter-operator data dependency to characterize redundant computation. Based on the analysis, we will depict how to achieve intensive fusion efficiently by identifying two categories of subgraph structures without redundancy.

III-B1 Why Re-computation Happens

Suppose we have two 2-d convolution operators in a subgraph SiS_{i}, and now we try to fuse them.

for n, o2, h, ow in range(N, O2, H2, W2 // 16):
# intra-tile loops, compute a tile of Conv1
for o1, ih1, iw1 in range(O1, 1 + R2 - 1, 16 + C2 - 1):
Conv1[n, o1, h+ih1, ow*16+iw1] = 0.0
for ri, rr1, rc1 in range(I, R1, C1): # reduction
Conv1[n, o1, h+ih1, ow*16+iw1] +=
# intra-tile loops, compute a tile of Conv2
for io2, ih2, iw2 in range(1, 1, 16):
Conv2[…, ow*16+iw2] = 0.0
for ro, rr2, rc2 in range(O1, R2, C2): # reduction
Conv2[…] += \ Conv1[…, ow*16+iw2+rc2] * Weight2[…]
Figure 5: Intensive fusion program of two convolutions.

In Fig. 5, we use (N,O1,H1,W1)(N,O_{1},H_{1},W_{1}) and (N,O2,H2,W2)(N,O_{2},H_{2},W_{2}) to represent the shapes of the upstream Conv1 tensor and the downstream Conv2 tensor, respectively. In this case, we also have H1=H2+(R21)H_{1}=H_{2}+(R_{2}-1) and W1=W2+(C21)W_{1}=W_{2}+(C_{2}-1) according to the convolution algorithm. For simplicity, assume the tiling of Conv2 is 1×1×161\times 1\times 16 on O2×H2×W2O_{2}\times H_{2}\times W_{2} dimensions (W2>16W_{2}>16). A tile is a fraction of some tensor. Namely, the intra-tile loops {io2, ih2, iw2} only compute a vector of length 1616 of Conv2. According to the convolution algorithm, this vector-like tile requires a O1×(1+(R21))×(16+(C21))O_{1}\times\left(1+(R_{2}-1)\right)\times\left(16+(C_{2}-1)\right) tile of Conv1 for computation, which is provided by the {o1, ih1, iw1} loops. However, the reduction loops of the upstream convolution (i.e., the ri, rr1, rc1 loops) will be executed N×O2×H2×W216×O1×R2×(15+C2)N\times O_{2}\times H_{2}\times\dfrac{W_{2}}{16}\times O_{1}\times R_{2}\times(15+C_{2}) times in total, which is much larger than the non-fusion case N×O1×(H2+R21)×(W2+C21)N\times O_{1}\times(H_{2}+R_{2}-1)\times(W_{2}+C_{2}-1).

We here formally depict the redundancy issue. We denote the global iteration space spanned by the loops of the upstream operator and the downstream operator as GS1GS_{1} and GS2GS_{2}. Then the amount of computation can be calculated as |GS1||GS_{1}| and |GS2||GS_{2}| respectively. After loop tiling, we denote the iteration space spanned by the inner intra-tile loops as TS1TS_{1} and TS2TS_{2}. Then the amount of computation is |GS1TS1||TS1|\left|\frac{GS_{1}}{TS_{1}}\right|\cdot|TS_{1}| and |GS2TS2||TS2|\left|\frac{GS_{2}}{TS_{2}}\right|\cdot|TS_{2}| respectively. Here we use ()()\frac{(\cdot)}{(\cdot)} as the inverse operator of Cartesian product ×\times. Next, after loop fusion, the intra-tile loops for TS1TS_{1} of the upstream operator will be attached to the loop structure GS2TS2\frac{GS_{2}}{TS_{2}} of the downstream operator. Thus the iteration space size of the upstream operator can be derived as |GS2TS2×(GS1TS1GS2TS2)||TS1|\left|\frac{GS_{2}}{TS_{2}}\times\left(\frac{GS_{1}}{TS_{1}}-\frac{GS_{2}}{TS_{2}}\right)\right|\cdot|TS_{1}|. This formula is larger than |GS1||GS_{1}| (i.e., redundancy) in two cases: 1) GS2TS2GS1TS1\frac{GS_{2}}{TS_{2}}-\frac{GS_{1}}{TS_{1}}\neq\emptyset (i.e., GS2TS2\frac{GS_{2}}{TS_{2}} contains a loop that is not needed by GS1TS1\frac{GS_{1}}{TS_{1}}); 2) |TS2|<|TS1||TS_{2}|<|TS_{1}|, which is only determined by the data mapping function of the downstream operator.

For Fig. 5, the outermost iteration space GS2TS2\frac{GS_{2}}{TS_{2}} is spanned by loops {n, o2, h, ow}. The middle part GS1TS1GS2TS2\frac{GS_{1}}{TS_{1}}-\frac{GS_{2}}{TS_{2}} is an empty set, where GS1TS1\frac{GS_{1}}{TS_{1}} consists of {n, h, ow}. And, the innermost TS1TS_{1} involves loops {o1, ih1, iw1}. The first source of redundancy is the loop o2 (GS2TS2GS1TS1)\in\left(\frac{GS_{2}}{TS_{2}}-\frac{GS_{1}}{TS_{1}}\right), thus Conv1 is recomputed for O2O_{2} times since a tile of Conv1 is reused by O2O_{2} channels/tiles of Conv2. Moreover, Conv1 is recomputed for H2W2×R2×(15+C2)H1W1×1×16\frac{H_{2}W_{2}\times R_{2}\times(15+C_{2})}{H_{1}W_{1}\times 1\times 16} times on the H2×W2H_{2}\times W_{2} dimensions because sliding-window operations in convolution have overlaps on H1H_{1} and W1W_{1} of Conv1 such that |TS2|<|TS1||TS_{2}|<|TS_{1}|. And an overlapping region is reused by multiple tiles of Conv2.

Refer to caption
Figure 6: Redundant computation after tiling and fusion.

We further illustrate the above issue in Fig. 6, where the yellow circle represents the output element ee of the upstream operator, while the blue circles denote the output elements {d1,d2,d3}\{d_{1},d_{2},d_{3}\} of the downstream operator. Suppose the three blue elements reside in three separate data tiles after loop tiling. Then, with fusion, the yellow circle will be duplicated and stitched into three blue tiles, which leads to re-computation for three times. For the example in Fig. 5, {d1,d2,d3}\{d_{1},d_{2},d_{3}\} can represent the O2O_{2} output channels.

When {d1,d2,,dk}\{d_{1},d_{2},...,d_{k}\} are distributed in two or more data tiles, ee may be re-computed for each tile, thus leading to computation redundancy.

III-B2 Removing the Re-computation

As above, the re-computation will only occur under two conditions: 1) data reuse for the output tensor of the upstream operator; 2) two or more output data tiles in the downstream operator. The first condition cannot be removed, since it is determined only by the operator definition (e.g., convolution algorithm). By contrast, we can break the second condition by computing the downstream complex operator without loop tiling on the reused dimensions. Consequently, as the rightmost side in Fig. 6 illustrates, three blue elements reside in the same tile, and the yellow circle has only one out-going edge.

For common convolutions, the input tensor (i.e., the output tensor of the upstream operator) will be reused in three dimensions O2,H2,W2O_{2},H_{2},W_{2} as that in Fig. 5. Without tiling, the original output data size of O2×H2×W2O_{2}\times H_{2}\times W_{2} will normally be larger than the cache capacity, hence poor cache utilization during execution. Fortunately, there are widely used convolution operators on mobile devices without this concern, namely, depthwise convolution and pointwise convolution. The former does not perform reduction on the input channel dimension while the latter is free of reduction in kernels (i.e., R2=C2=1R_{2}=C_{2}=1). Specifically, the input tensor will be reused only on H2,W2H_{2},W_{2} dimensions in depthwise convolution, and only O2O_{2} dimension in pointwise convolution.

Refer to caption
(a) Downstream depthwise conv.
Refer to caption
(b) Downstream pointwise conv.
Figure 7: Inter-operator data dependency in intensive fusion.

We achieve intensive fusion for these two categories in Fig. 7. We use uppercase letters to denote the original dimensions, while the corresponding lowercase letters to denote the tiled dimensions. When the downstream convolution is depthwise, as shown in Fig. 7(a), its data tile should be H2W2o2H_{2}W_{2}o_{2}, where H2,W2H_{2},W_{2} dimensions are not tiled due to reuse and o2o_{2} is a tiled dimension from O2O_{2}. Then the data tile of the upstream convolution should be H1W1o1H_{1}W_{1}o_{1}. In this case, we will also have o1=o2o_{1}=o_{2} since the number of input channels is the same as the number of output channels in the downstream depthwise convolution. Similarly, the tile of the downstream pointwise convolution can be denoted as h2w2O2h_{2}w_{2}O_{2} in Fig. 7(b). Both of them is much smaller than O2×H2×W2O_{2}\times H_{2}\times W_{2}. Notably in Fig. 7, the inter-operator data mapping is determined by the convolution algorithm. For example, in Fig. 7(b), the Conv2 tile h2w2O2h_{2}w_{2}O_{2} requires a Conv1 tile h1w1o1h_{1}w_{1}o_{1} and a Weight2 tile o1R2C2O2o_{1}R_{2}C_{2}O_{2} for computation, where h1=h2h_{1}=h_{2}, w1=w2w_{1}=w_{2}, and R2=C2=1R_{2}=C_{2}=1. Additionally, no constraints are imposed on the inner-level tiling of the two convolutions, indicated as the red dashes in Fig. 7. In other words, if intensive fusion requires that two convolution tiles reside in L1 cache for locality, then the tiling of themselves on registers will not be affected.

Putting all the analysis together, our intensive fusion creates new optimization opportunities for complicated subgraphs, when the downstream convolution is depthwise or pointwise. We do not include the analysis of matrix multiplication because it is mathematically equivalent to pointwise convolution. Even if the downstream convolution type is unmet for intensive fusion, our tuner can still benefit from a larger tuning space given a complicated subgraph. Meanwhile, the tuning efficiency for such unmet complicated subgraphs will be addressed by the later reformer layer. Therefore, our tuner eschews the need of prior constraints on subgraph structures in favor of intensive fusion and joint optimization.

IV Graph Partitioning in Frontend

Given the original directed computational graph GG, the graph frontend will partition GG into many smaller subgraphs, each of which can be optimized separately and efficiently. Here, we refer to the graph partition as a collective term. It is formally defined as a set of subgraphs S1,S2,,SnS_{1},S_{2},...,S_{n}, such that the nodes in each subgraph are disjointed and each node in GG belongs to exactly one subgraph. Based on our powerful tuner, we allow an arbitrary structure for each SiS_{i}. This further indicates that any edge in GG can potentially cross the cut in the partition. However, this can lead to 1) search space explosion and 2) cycles in the resulting graph partition. Larger search space will increase the optimization complexity, thus decrease the tuning efficiency for the tuner backend. To strike this issue, our solution involves two aspects. First, in the graph frontend, our partitioning algorithm will assign a weight for each operator and control the weight of each subgraph, hence obviating unreasonably huge subgraphs. Second, in the later reformer layer, we design a divide-and-conquer tuning mechanism to reduce the tuning complexity for each subgraph. Another issue is the cyclic dependency among subgraphs, which may disable some critical optimization techniques, e.g., data layout selection, and lead to deadlocks during runtime execution. To generate a cycle-free partition, we devise an iterative clustering algorithm, which will theoretically guarantee the acyclic property of the graph partitioning.

IV-A Weight Assignment for Operators

Refer to caption
Figure 8: Budget costs of optimizing different subgraphs.

In a computational graph GG, we denote the set of nodes as VV and the set of directed edges as EE. Each node in the graph, denoted as vv, represents an operator in the model, and each of the directed edges, denoted as ee, represents a tensor produced by the source operator and consumed by the destination operator.

To avoid an unreasonably huge subgraph SiS_{i}, we resort to measuring the tuning complexity of SiS_{i} directly during partitioning. Previous works use indirect metrics as weights [5], e.g., the number of operators in a subgraph SiS_{i}, which we find ineffective. Specifically, we observe that the contributions of operators in SiS_{i} towards the tuning complexity are not the same and highly depend on their tensor shapes and operator types. We here conduct an experiment to study the relationship between the subgraph structure and the tuning complexity. We use tuning budget, which is the total number of explored schedules to obtain stable performance for a subgraph, as an indicator of the tuning complexity. We then tune different subgraphs, each of which contains different operators, and record their tuning budgets.

We report the results in Fig. 8, where the budget is on a scale of 100, the batch size is 1, the padding size in convolution is 1, the height/width of the convolutional window is 3, and the numbers behind IOHWIOHW are the sizes of other corresponding dimensions. For example, in the second subgraph (Conv+AddConv+Add), the input tensor has shape (NN=1, II=32, HH=28, WW=28), the output tensor of ConvConv operator has shape (NN=1, OO=64, HH=28, WW=28). Based on Fig. 8, we have two observations. First, for each subgraph structure, the tuning budget does not scale directly with the number of operators, but illustrates a linear trend with tensor shapes. Second, for a given tensor shape, the tuning budget scales almost linearly with the number of operators, although the tuning space size increases exponentially with the number of operators.

The root cause of the first observation is that the most major optimization technique during tuning is loop transformation, e.g., loop tiling and fusion in Section III-B. Thus the tuning complexity is directly determined by the loop nest in the program for the operator. This further involves two folds: 1) the number of loops (e.g., seven nested loops in a 2-d convolution); 2) the extent of each loop. Thus, we define a fine-grained weight for each operator as follows, which measures the tuning complexity as a linear function of the loop nest:

wv=c×lLvlog(sl)+b,w_{v}=c\times\prod_{l\in L_{v}}{\log(s_{l})}+b\,, (1)

where LvL_{v} is the set of loops for the operator vv, sls_{l} is the extent of the loop lLvl\in L_{v}, while cc is the slope and bb is the bias. With (1), it is easy to see a larger weight indicates higher complexity of tuning. Then, according to the second observation, the weight of a subgraph SiS_{i} can be derived as the sum of the weights of all operators in SiS_{i}. As illustrated by the black dash line in Fig. 8, we can almost perfectly fit the tuning budget with Eq. 1. Subsequently, we are able to guarantee a tractable size for each subgraph by setting up a threshold as the maximum weight. Moreover, such design helps eliminating trivial subgraphs that may waste tuning budgets and yielding balance among all subgraphs.

IV-B Acyclic Partitioning

Refer to caption
Figure 9: Cyclic dependency with complicated subgraphs.

After calculating weights, we here propose a new algorithm to address the issue of cyclic dependency. To allow arbitrary subgraph structures, the graph frontend can incur cycles in the resulting partition unexpectedly. For example, suppose Conv1Conv1 and Conv3Conv3 in Fig. 9 can trigger the intensive fusion. Then we put them in the same subgraph S1S_{1}, while Conv2Conv2 constitutes another subgraph S2S_{2}. In this case, S1S_{1} and S2S_{2} have inter-dependency in input tensors and output tensors. Such cycles can lead to deadlocks when executing these subgraphs at the runtime. Although prior works employ heuristics to produce subgraphs without cycles [5, 6], the generated subgraphs are over simplified and many opportunities are thus excluded. For example, three complex operators in Fig. 9 will be placed into three separate subgraphs [5, 6], hence missing opportunities of intensive fusion and joint optimization.

We first formally define the acyclic property for a graph partition. We call a partition without cycles a nn-way acyclic partition, if it satisfies the following property.

Definition 1.

nn-way acyclic partition: A nn-way acyclic partition contains nn disjoint sets of nodes {V1,V2,Vn}\{V_{1},V_{2},...V_{n}\}. For any u,uViu,u^{\prime}\in V_{i}, v,vVjv,v^{\prime}\in V_{j}, 1ijn1\leq i\neq j\leq n, there cannot exist two paths, from uu to vv and from vv^{\prime} to uu^{\prime}, at the same time.

Further, we introduce the concept of topological stage as an identifier to the topological position of each node in GG.

Definition 2.

topological stage: The topological stage tsv1ts_{v}\geq 1 is an integer denoting the position of vv in GG. It can be calculated as the length of the longest path from the root rr (a node with zero in-degree) to the current vv.

It is easy to see, for any node vVv\in V, e=(u,v)E\forall e=(u,v)\in E, we have tsu<tsvts_{u}<ts_{v}; e=(v,w)E\forall e=(v,w)\in E, we have tsw>tsvts_{w}>ts_{v}.

Based on the concept of topological stage, we observe that for each node vv, there exists a set of nodes that can be safely grouped with it. We call such a special set affix set.

Definition 3.

affix set: We first denote the set of nodes that vv can connect to in the underlying undirected graph corresponding to GG as UCvUC_{v}. Then, the affix set ASvAS_{v} for node vv is a subset of UCvUC_{v}, such that each node in ASv{AS}_{v} satisfies one of the following two conditions:

uASv,\displaystyle\forall{u}\in{AS}_{v}, tsu=tsv+1;\displaystyle\,{ts}_{u}={ts}_{v}+1;
uASv,\displaystyle\forall{u}\in{AS}_{v}, tsu=tsv1.\displaystyle\,{ts}_{u}={ts}_{v}-1.\vspace{-0.1in}

With the above definitions, we can derive the following theorem, which is the core of our later graph partitioning algorithm to guarantee the acyclic property.

Theorem 1.

Given a node vv and its affix set ASv{AS}_{v}, there will exist no cycles in the resulting graph partition if vv and any nodes in ASv{AS}_{v} cluster together to produce a new subgraph.

Proof.

We will prove the theorem by deducing a contradiction. Specifically, for any two nodes uu and vv in GG, assume that grouping uu and vv together will produce a cycle in the partition. Since the original graph GG is acyclic, the generated cycle indicates that there must exist a path upvu\rightarrow p\rightarrow v in GG, where pp is another node. For example, in Fig. 9, the Conv1,Conv2,Conv3Conv1,Conv2,Conv3 operators are u,p,vu,p,v, respectively. However, we only group uu and vv if uASvu\in{AS}_{v} or vASuv\in{AS}_{u}. Thus, we have |tsvtsu|=1|ts_{v}-ts_{u}|=1, which means the path from uu to vv or vv to uu must have no other nodes, i.e., pp does not exist. In summary, no cycle will be generated if vv and a node uASvu\in{AS}_{v} cluster together to generate a new subgraph. ∎

Algorithm 1 Graph Partition
1:function Cluster(GG)
2:     Initialize Td,Cand,TopStageTd,Cand,TopStage
3:     Calculate weight for each vv in GG
4:     while CandCand\neq\emptyset do
5:         Choose vCandv\in Cand with heaviest weight
6:         if uASv,s.t.,wv+wu<Td\exists u\in AS_{v},\textit{s.t.},w_{v}+w_{u}<Td then
7:              uu and vv cluster together as a hyper node vv^{\prime}
8:              Move vv^{\prime} to CandCand
9:         else
10:              Remove vv from CandCand
11:         end if
12:         Update E,TopStageE,TopStage
13:     end while
14:     Translate hyper nodes into subgraphs {S1,S2,}\{S_{1},S_{2},...\}
15:     return {S1,S2,}\{S_{1},S_{2},...\}
16:end function

Based on Theorem 1, we can derive our cycle-free graph partitioning algorithm, in which affix operators iteratively cluster together to yield subgraphs. We illustrate the Cluster algorithm in Algorithm 1. At 2, we pre-process the graph GG to initialize some data structures, including the maximum weight threshold TdTd for subgraphs, the initial candidate node set CandCand that contains all nodes in GG, and the information on topological stages TopStageTopStage. Then we calculate the weight for each operator at 3. Next, we group affix nodes to generate subgraphs, and the weight of each subgraph is controlled via a greedy strategy (4 - 13). In each iteration, a node vv (a subgraph SiS_{i} can be viewed as a hyper node vv) with the heaviest weight in CandCand is selected (5). We then search in ASvAS_{v} to find a node uu with the smallest weight. If the sum of weights of vv and uu is smaller than the threshold TdTd, a new subgraph will be produced. This subgraph is also viewed as a new hyper node vv^{\prime}, and put in the candidate set for further clustering (7 - 8). Otherwise, the node vv will be skipped and removed from CandCand (10). After each iteration, we will update the edge set EE and the position information TopStageTopStage (12). The algorithm will repeat until the candidate set is empty. In this way, the resulting partition is guaranteed to be acyclic and each subgraph has a reasonable weight that is smaller than TdTd.

V Divide-and-Conquer Tuning in Reformer Layer

The search space of a subgraph with a complicated structure is still large even if we limit its weight during graph partitioning. To achieve efficient tuning, we insert a reformer layer between the graph frontend and the tuner backend to orchestrate different components. The reformer layer exploits a divide-and-conquer mechanism to break down the complicated subgraph tuning into smaller sub-tasks and then combine them. As the dividing stage, we design a Split function. By invoking Cluster in Algorithm 1, the Split function further splits each large subgraph SiS_{i} into several mini-subgraphs {Mi1,Mi2,}\{M_{i1},M_{i2},...\}. Each mini-subgraph has at most one complex operator and a smaller weight. Then, as the conquering stage, we devise a Join function. It can combine those mini-subgraphs {Mi1,Mi2,}\{M_{i1},M_{i2},...\} back to SiS_{i} for further tuning.

The reformer layer will immediately execute the Split function after graph partitioning. Then, by inspecting the feedbacks from the tuner backend, the reformer layer will call the Join function to combine those mini-subgraphs MijM_{ij} back as SiS_{i}, if the tuning for each Mij(1jmi)M_{ij}(1\leq j\leq m_{i}) tends to stabilize. During joining, the schedules searched by the tuner for each mini-subgraph will also be composed as a large schedule for SiS_{i}. When delivering SiS_{i} to the tuner backend, this combined schedule will be treated as the initial schedule to evade inefficient tuning from the scratch for SiS_{i}.

Our cycle-free partitioning function Cluster only limits the maximum weight for each subgraph without other constraints on structures. Further, with the Split and Join functions, AGO is able to optimize any subgraph efficiently.

VI Evaluation

We implemented the graph frontend in C++ based on TVM (0.8dev1) [3], and the reformer layer and the tuner in both Python and C++. In this section, we evaluate AGO on two mobile CPU platforms with Kirin 990 SoC (Android v10), representing high-end devices, and Qualcomm snapdragon (Qsd) 810 SoC (Android v8), representing low-end devices with strict resource constraints. We compare AGO with Torch Mobile [18] and Ansor [9]. Torch Mobile is a widely used deep learning framework, which employs a hand-tuned high-performance library XNNPACK [21] developed by Google. While Ansor is the state-of-the-art auto-tuning framework based on TVM, which performs better than TensorFlow Lite [19, 9]. Thus, TensorFlow Lite is not included in our benchmarks.

VI-A End-to-End Performance

Refer to caption
(a) Input tensor shape (1, 3, 56, 56).
Refer to caption
(b) Input tensor shape (1, 3, 112, 112).
Refer to caption
(c) Input tensor shape (1, 3, 224, 224).
Figure 10: End-to-end inference performance on Qsd 810 SoC.
Refer to caption
(a) Input tensor shape (1, 3, 56, 56).
Refer to caption
(b) Input tensor shape (1, 3, 112, 112).
Refer to caption
(c) Input tensor shape (1, 3, 224, 224).
Figure 11: End-to-end inference performance on Kirin 990 SoC.
Refer to caption
Figure 12: End-to-end inference performance for BT and MVT.

In this subsection, we evaluate AGO in terms of end-to-end inference performance. Our benchmarks cover four classical neural networks: MobileNet-V2 (MBN) [11], MNasNet (MNSN) [12], SqueezeNet (SQN) [13], and ShuffleNet-V2 (SFN) [14], and two emerging new networks: Bert-tiny (BT) [15, 16] and MobileViT (MVT) [17]. These networks are lightweight and widely used for mobile deep learning services. For both Kirin 990 SoC and Qsd 810 SoC, we set the batch size to 1 for all input tensors due to constrained computing power, which is also a general setting for mobile inference. Besides, we test different shapes of the input tensor for each classical network: small shape (NN=1, II=3, HH=56, WW=56), middle shape (NN=1, II=3, HH=112, WW=112), and large shape (NN=1, II=3, HH=224, WW=224). These shapes represent various workloads due to divergent image resolutions in real applications. For the new language model BT, we set the input sequence length to 128, which is the longest sequence it supports. For the new model MVT, we only evaluate it on the large shape, which is the image size of the Imagenet dataset [22]. All networks are executed with float32 precision. Besides, we set the search budget for AGO and Ansor to 20,000, which is suggested by Ansor [9] for sufficient tuning.

We report the speedup of each method over Torch Mobile for classical networks in Fig. 10 and Fig. 11, where the number on the top of each bar is the raw latency in milliseconds. On the Qsd 810 SoC, AGO achieves an average speedup of 1.5×1.5\times, 1.6×1.6\times, and 1.8×1.8\times over Torch Mobile on three input tensor shapes respectively. Compared with Ansor, AGO achieves an average speedup of 1.2×1.2\times on each input shape. The main reason behind the significant improvement over Torch Mobile is that, hand-tuned libraries often put tremendous engineering efforts on optimizing typical workloads, while other non-typical operators are less optimized. The speedup over Ansor mainly originates from the intensive fusion and joint optimization for complicated subgraphs, which are the opportunities missed by Ansor. For example, when there are many subgraphs with consecutive pointwise and depthwise convolutions, AGO achieves an average of 1.3×1.3\times speedup over Ansor.

Similarly, on the Kirin 990 SoC, AGO achieves average 1.9×1.9\times, 2.1×2.1\times, and 1.5×1.5\times speedup over Torch Mobile on three input tensor shapes respectively. Compared with Ansor, AGO achieves average 2.6×2.6\times, 1.6×1.6\times, and 1.1×1.1\times speedup respectively. Again, such improvements are directly owing to our intensive fusion and joint optimization. By contrast, Ansor suffers from a limited tuning space due to the simple subgraph structures generated by Relay [5]. For example, AGO outperforms both baselines on MNSN significantly, which involves massive pointwise and depthwise convolutions. Both Torch Mobile and Ansor can only perform conventional fusion, while AGO can achieve either intensive fusion or joint optimization.

We further employ AGO to optimize BT and MVT, which are two new networks. We report the results in Fig. 12, where we do not test MVT on the Qsd 810 SoC due to its limited resources. Compared with Torch Mobile, AGO improves the performance by 38.2%38.2\% on BT and 34.3%34.3\% on MVT, respectively. Further, compared with Ansor, AGO improves the performance by 20.5%20.5\% on BT and 29.1%29.1\% on MVT. In summary, AGO can be used to boost new neural architectures readily without any interference.

Additionally, the tuning budget of 20,000 implies up to a day of the compilation time. But this is affordable to practitioners since they only need to execute AGO once before the long-run deployment. Moreover, it is much shorter than weeks or even months of hand-tuning.

VI-B Micro Benchmark

Refer to caption
(a) Two depthwise convolutions.
Refer to caption
(b) Depthwise conv & pointwise conv.
Refer to caption
(c) Pointwise conv & depthwise conv.
Refer to caption
(d) Two pointwise convolutions.
Figure 13: Subgraph performance with different variants of AGO.

In this subsection, we further study where the performance gain of AGO comes from, to evaluate our intensive fusion and reformer layer. Then, we will evaluate our graph partitioning algorithm, by respectively inspecting the generated subgraphs partitioned by our algorithm and Relay [5].

We first compare three variants of AGO to break down the improvements: 1) AGO-NI (no intensive fusion in the tuner backend); 2) AGO-NR (no reformer layer, i.e., tuning a large subgraph directly); 3) AGO, same as Section VI-A, as the baseline. We then evaluate them on four subgraphs. Each subgraph consists of two complex operators and some other simple operators. The complex operator is either pointwise convolution or depthwise convolution. Except the subgraph with two depthwise convolutions, other subgraphs are extracted from MBN and MNSN. Additionally, the tuning budget is 2,000 for each variant and subgraph.

We present the results in Fig. 13, where the number behind BB is the batch size. AGO-NI has average 17%17\% performance loss compared with AGO on two platforms. The reason is that AGO can fuse multiple complex operators to further improve the performance, while AGO-NI only optimizes them jointly with conventional fusion. Further, AGO-NR has nearly 27% performance loss. This is because directly optimizing a complicated subgraph is hard, while AGO addresses this issue through the divide-and-conquer tuning mechanism. We also observe that there are some cases where AGO-NI outperforms AGO Fig. 13(d). This indicates that AGO cannot find better schedules due to the increased search space size after joining mini-subgraphs, hence inefficient budget usage in such workloads. Thus, this issue can be addressed by prolonging the tuning for mini-subgraphs before joining.

Refer to caption
Figure 14: Subgraph weight distribution for MVT.

Next, we evaluate our graph partitioning algorithm. We present the subgraph weight distribution for MVT in Fig. 14. We construct ten weight bins in log scale (e.g., bin [1,2)[1,2) means weight interval [21,22)[2^{1},2^{2})), and report the number of subgraphs in each bin. The new model MVT integrates attention modules, yielding a large number of reshape and transpose operators. Relay will heuristically take such operators as delimiters to produce totally 259 subgraphs, where 105 of them are trivial and have a weight less than 20. Besides, the average weight, the median weight, and the Jain’s fairness index (measuring balance and higher is better) are 138, 23, and 0.19, respectively. In contrast, AGO generates 82 subgraphs. Most of them have a large weight as shown in Fig. 14. For AGO, the average weight, the median weight, and the Jain index are 437, 350, and 0.55, respectively. Therefore, our partitioning algorithm can generate more complicated subgraphs while maintaining balance. Take a typical structure in MVT as an example, which contains eight consecutive operators: matrix multiplication, reshape, add, reshape, transpose, reshape, matrix multiplication, and reshape. Relay produces five fragmented subgraphs for this structure, missing opportunities of intensive fusion for two matrix multiplications and joint optimization for all simple operators. Such partition leads to inferior performance since the reshape/transpose operators involve expensive memory loads/stores. By contrast, AGO typically groups such operators together to boost the performance.

VII Related Work

Deep learning compiler: Various deep compilers have been proposed to address the error-prone manual optimization [2, 3, 7, 23, 24, 8, 9, 25, 26, 27, 28, 29, 30, 10, 31, 32]. Tensor Comprehension [23] and TVM [3] adopt the idea of decoupling optimization from operator description to simplify the auto-tuning process, which is then provided by AutoTVM [7], FlexTensor [8], Ansor [9], ALT [32] via search algorithms. TASO [24], Tensat [30], and PET [31] perform graph substitutions to generate more efficient graphs. To speed up the tuning, AKG [10] applies the polyhedron model, while delicate cost models [26, 27] and heuristics [28, 33] are also proposed. Compared with AGO, 1) their tuners do not support intensive fusion; 2) their frontends take the tuner as a black box and no cross-layer mechanism is involved; 3) their graph partitioning algorithms impose unnecessary constraints on subgraph structures, thus can only generate simple subgraphs.

Operator fusion: Many systems exploit operator fusion as an important optimization technique [25, 34, 6, 35, 10, 36]. In general, they can fuse a complex operator with its following simple operators, which is named as conventional fusion in this work. For instance, [36] can fuse memory-bound operators for NVIDIA GPU based on XLA [2]. [10, 6] exploits the polyhedral model to explore the fusion opportunity. Although two matrix multiplications can also be fused on NVIDIA GPU in Bolt [35], it is implemented based on the vendor library CUTLASS [37] and the fact that the cache (shared memory) in NVIDIA GPU is programmable. Thus, Bolt can only fuse matrix multiplications that CUTLASS supports, while cannot fuse general complex operators on CPUs. Compared with these works, AGO enables generic auto-tuned intensive fusion on mobile devices, while guaranteeing efficiency via careful analysis on computation redundancy.

Computational graph partitioning: Classical graph partitioning has been extensively studied [38], typically in distributed computing area. In deep learning systems [39, 40, 41, 42, 43], they are often used to increase the parallelism to improve performance. For instance, IOS [39] and Unity [43] exploit partitioning to parallelize subgraphs on NVIDIA GPU. [40] partitions the graph to reduce the peak runtime memory footprint. SPINN [41], CLIO [42], and Walle [44] partition a computational graph into two parts, with one part running on the device and the other running on the cloud/edge server. Compared with these systems, the major goal of our graph partitioning is to improve mobile inference performance via compilation techniques while keeping acyclic theoretically.

VIII Conclusion

We propose AGO, a framework that removes the constraints on graph optimization to boost the inference performance of AI models on mobile devices. AGO provides a new partitioning scheme to generate arbitrary subgraphs while keeping acyclic. It also designs a potent tuner which proposes intensive operator fusion and joint optimization to boost arbitrary subgraphs. Additionally, AGO devises a divide-and-conquer mechanism to address the tuning efficiency. Experiments show that AGO significantly outperforms state-of-the-art hand-tuned libraries and makes great progress over auto-tuning frameworks. For researchers, they can use AGO to further explore the performance characteristics of complicated subgraphs. For practitioners, they can easily exploit AGO to improve the inference performance without human interference.

IX Acknowledgement

We thank the anonymous reviewers of INFOCOM 2023 for their valuable comments. This work is partially supported by the National Natural Science Foundation of China under Grant Number 62272213 and the Jiangsu Innovation and Entrepreneurship (Shuangchuang) Program.

References

  • [1] X. Jiang, H. Wang, Y. Chen, Z. Wu, L. Wang, B. Zou, Y. Yang, Z. Cui, Y. Cai, T. Yu, C. Lv, and Z. Wu, “MNN: A universal and efficient inference engine,” in Proceedings of MLSys, 2020.
  • [2] C. Leary and T. Wang, “Xla: Tensorflow, compiled,” TensorFlow Dev Summit, 2017.
  • [3] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze et al., “TVM: An automated end-to-end optimizing compiler for deep learning,” in Proceeding of USENIX OSDI, 2018, pp. 578–594.
  • [4] R. Baghdadi, J. Ray, M. B. Romdhane, E. Del Sozzo, A. Akkas, Y. Zhang, P. Suriana, S. Kamil, and S. Amarasinghe, “Tiramisu: A polyhedral compiler for expressing fast and portable code,” in Proceedings of IEEE/ACM CGO, 2019, pp. 193–205.
  • [5] J. Roesch, S. Lyubomirsky, L. Weber, J. Pollock, M. Kirisame, T. Chen, and Z. Tatlock, “Relay: A new IR for machine learning frameworks,” in Proceedings of ACM MAPL, 2018, pp. 58–68.
  • [6] J. Zhao, X. Gao, R. Xia, Z. Zhang, D. Chen, L. Chen, R. Zhang, Z. Geng, B. Cheng, and X. Jin, “Apollo: Automatic partition-based operator fusion through layer by layer optimization,” in Proceedings of MLSys, 2022.
  • [7] T. Chen, L. Zheng, E. Yan, Z. Jiang, T. Moreau, L. Ceze, C. Guestrin, and A. Krishnamurthy, “Learning to optimize tensor programs,” in Proceedings of NeurIPS, 2018, pp. 3389–3400.
  • [8] S. Zheng, Y. Liang, S. Wang, R. Chen, and K. Sheng, “Flextensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system,” in Proceedings of ACM ASPLOS, 2020, pp. 859–873.
  • [9] L. Zheng, C. Jia, M. Sun, Z. Wu, C. H. Yu, A. Haj-Ali, Y. Wang, J. Yang, D. Zhuo, K. Sen et al., “Ansor: generating high-performance tensor programs for deep learning,” in Proceedings of USENIX OSDI, 2020, pp. 863–879.
  • [10] Z. Jie, Li, Bojie, N. Wang, G. Zhen, Z. Renwei, G. Xiong, C. Bin, W. Chen, C. Yun, L. Zheng, D. Peng, Z. Kun, and J. Xuefeng, “Akg: automatic kernel generation for neural processing units using polyhedral transformations,” in Proceedings of ACM PLDI, 06 2021, pp. 1233–1248.
  • [11] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “MobileNets: Efficient convolutional neural networks for mobile vision applications,” 2017.
  • [12] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le, “MnasNet: Platform-aware neural architecture search for mobile,” in Proceedings of IEEE/CVF CVPR, June 2019.
  • [13] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <0.5mb model size,” 2016.
  • [14] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “Shufflenet v2: Practical guidelines for efficient cnn architecture design,” in Proceedings of ECCV, 2018, pp. 116–131.
  • [15] I. Turc, M. Chang, K. Lee, and K. Toutanova, “Well-read students learn better: The impact of student initialization on knowledge distillation,” CoRR, vol. abs/1908.08962, 2019.
  • [16] P. Bhargava, A. Drozd, and A. Rogers, “Generalization in nli: Ways (not) to go beyond simple heuristics,” 2021.
  • [17] S. Mehta and M. Rastegari, “Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer,” in Proceedings of ICLR, 2022.
  • [18] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” in Proceedings of NeurIPS, 2019, pp. 8026–8037.
  • [19] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for large-scale machine learning,” in Proceedings of USENIX OSDI, 2016, pp. 265–283.
  • [20] M. Wang, S. Ding, T. Cao, Y. Liu, and F. Xu, “Asymo: scalable and efficient deep-learning inference on asymmetric mobile cpus,” in Proceedings of ACM MobiCom, 2021, pp. 215–228.
  • [21] Google, “Xnnpack: Highly optimized library of floating-point neural network inference operators for arm, webassembly, and x86 platforms,” 2021. [Online]. Available: https://github.com/google/XNNPACK
  • [22] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proceedings of CVPR.   Ieee, 2009, pp. 248–255.
  • [23] N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito, W. S. Moses, S. Verdoolaege, A. Adams, and A. Cohen, “Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions,” arXiv preprint arXiv:1802.04730, 2018.
  • [24] Z. Jia, O. Padon, J. Thomas, T. Warszawski, M. Zaharia, and A. Aiken, “Taso: optimizing deep learning computation with automatic generation of graph substitutions,” in Proceedings of SOSP, 2019, pp. 47–62.
  • [25] Z. Zheng, P. Zhao, G. Long, F. Zhu, K. Zhu, W. Zhao, L. Diao, J. Yang, and W. Lin, “Fusionstitching: boosting memory intensive computations for deep learning workloads,” arXiv preprint arXiv:2009.10924, 2020.
  • [26] Y. Wang, X. Zhou, Y. Wang, R. Li, Y. Wu, and V. Sharma, “Tuna: A static analysis approach to optimizing deep neural networks,” arXiv preprint arXiv:2104.14641, 2021.
  • [27] R. Li, Y. Xu, A. Sukumaran-Rajam, A. Rountev, and P. Sadayappan, “Analytical characterization and design space exploration for optimization of cnns,” in Proceedings of ACM ASPLOS, 2021, pp. 928–942.
  • [28] B. Steiner, C. Cummins, H. He, and H. Leather, “Value learning for throughput optimization of deep learning workloads,” Proceedings of MLSys, vol. 3, 2021.
  • [29] R. Baghdadi, M. Merouani, M.-H. Leghettas, K. Abdous, T. Arbaoui, K. Benatchba et al., “A deep learning based cost model for automatic code optimization,” Proceedings of MLSys, vol. 3, 2021.
  • [30] Y. Yang, P. Phothilimthana, Y. Wang, M. Willsey, S. Roy, and J. Pienaar, “Equality saturation for tensor graph superoptimization,” Proceedings of MLSys, vol. 3, 2021.
  • [31] H. Wang, J. Zhai, M. Gao, Z. Ma, S. Tang, L. Zheng, Y. Li, K. Rong, Y. Chen, and Z. Jia, “PET: Optimizing tensor programs with partially equivalent transformations and automated corrections,” in Proceedings of USENIX OSDI.   USENIX Association, Jul. 2021, pp. 37–54.
  • [32] Z. Xu, J. Xu, H. Peng, W. Wang, X. Wang, H. Wan, H. Dai, Y. Xu, H. Cheng, K. Wang et al., “Alt: Breaking the wall between graph and operator level optimizations for deep learning compilation,” arXiv preprint arXiv:2210.12415, 2022.
  • [33] H. Zhu, R. Wu, Y. Diao, S. Ke, H. Li, C. Zhang, J. Xue, L. Ma, Y. Xia, W. Cui et al., “{\{ROLLER}\}: Fast and efficient tensor compilation for deep learning,” in Proceedings of USENIX OSDI, 2022, pp. 233–248.
  • [34] W. Niu, J. Guan, Y. Wang, G. Agrawal, and B. Ren, “Dnnfusion: Accelerating deep neural networks execution with advanced operator fusion,” in Proceedings of ACM PLDI, 2021, p. 883–898.
  • [35] J. Xing, L. Wang, S. Zhang, J. Chen, and Y. Zhu, “Bolt: Bridging the gap between auto-tuners and hardware-native performance,” arXiv e-prints, 2021.
  • [36] Z. Zheng, X. Yang, P. Zhao, G. Long, K. Zhu, F. Zhu, W. Zhao, X. Liu, J. Yang, J. Zhai et al., “Astitch: enabling a new multi-dimensional optimization space for memory-intensive ml training and inference on modern simt architectures,” in Proceedings of ACM ASPLOS, 2022, pp. 359–373.
  • [37] Nvidia, “Nvidia/cutlass: Cuda templates for linear algebra subroutines.” [Online]. Available: https://github.com/NVIDIA/cutlass
  • [38] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, “Pregel: a system for large-scale graph processing,” in Proceedings of MOD, 2010, pp. 135–146.
  • [39] Y. Ding, L. Zhu, Z. Jia, G. Pekhimenko, and S. Han, “Ios: Inter-operator scheduler for cnn acceleration,” 2021.
  • [40] B. H. Ahn, J. Lee, J. M. Lin, H.-P. Cheng, J. Hou, and H. Esmaeilzadeh, “Ordering chaos: Memory-aware scheduling of irregularly wired neural networks for edge devices,” 2020.
  • [41] S. Laskaridis, S. I. Venieris, M. Almeida, I. Leontiadis, and N. D. Lane, “Spinn: synergistic progressive inference of neural networks over device and cloud,” in Proceedings of ACM MobiCom, 2020, pp. 1–15.
  • [42] J. Huang, C. Samplawski, D. Ganesan, B. Marlin, and H. Kwon, “Clio: Enabling automatic compilation of deep learning pipelines across iot and cloud,” in Proceedings of ACM MobiCom, 2020, pp. 1–12.
  • [43] C. Unger, Z. Jia, W. Wu, S. Lin, M. Baines, C. E. Q. Narvaez, V. Ramakrishnaiah, N. Prajapati, P. McCormick, J. Mohd-Yusof, X. Luo, D. Mudigere, J. Park, M. Smelyanskiy, and A. Aiken, “Unity: Accelerating DNN training through joint optimization of algebraic transformations and parallelization,” in Proceedings of USENIX OSDI.   Carlsbad, CA: USENIX Association, Jul. 2022, pp. 267–284.
  • [44] C. Lv, C. Niu, R. Gu, X. Jiang, Z. Wang, B. Liu, Z. Wu, Q. Yao, C. Huang, P. Huang, T. Huang, H. Shu, J. Song, B. Zou, P. Lan, G. Xu, F. Wu, S. Tang, F. Wu, and G. Chen, “Walle: An End-to-End, General-Purpose, and Large-Scale production system for Device-Cloud collaborative machine learning,” in Proceedings of USENIX OSDI.   Carlsbad, CA: USENIX Association, Jul. 2022, pp. 249–265.