MLGOPerf: An ML Guided Inliner to Optimize Performance

Amir H. Ashouri^∗ Mostafa Elhoushi Yuzhe Hua Xiang Wang Muhammad Asif Manzoor Bryan Chan Yaoqing Gao Huawei Technologies, Heterogeneous Compiler Lab
Toronto, Canada

Abstract.

For the past 25 years, we have witnessed an extensive application of Machine Learning to the Compiler space; the selection and the phase-ordering problem. However, limited works have been upstreamed into the state-of-the-art compilers, i.e., LLVM, to seamlessly integrate the former into the optimization pipeline of a compiler to be readily deployed by the user. MLGO was among the first of such projects and it only strives to reduce the code size of a binary with an ML-based Inliner using Reinforcement Learning.

This paper presents MLGOPerf; the first end-to-end framework capable of optimizing performance using LLVM’s ML-Inliner. It employs a secondary ML model to generate rewards used for training a retargeted Reinforcement learning agent, previously used as the primary model by MLGO. It does so by predicting the post-inlining speedup of a function under analysis and it enables a fast training framework for the primary model which otherwise wouldn’t be practical. The experimental results show MLGOPerf is able to gain up to 1.8% and 2.2% with respect to LLVM’s optimization at O3 when trained for performance on SPEC CPU2006 and Cbench benchmarks, respectively. Furthermore, the proposed approach provides up to 26% increased opportunities to autotune code regions for our benchmarks which can be translated into an additional 3.7% speedup value.

Compilers, Reinforcement Learning, Deep Learning, LLVM, Inlining

*Corresponding author: [email protected]

^†^†conference: CASES ’22: International Conference on Compilers, Architecture, and Synthesis of Embedded Systems (CASES); October 9–14, 2022; Shanghai, China^†^†ccs: Software and its engineering Compilers^†^†ccs: Computing methodologies Machine learning

{textblock}

15(2,1) The short version of this work is accepted at ACM/IEEE CASES’22 – https://esweek.org/cases

1. Introduction

Compilers have gone a long way to emit efficient and high performance code given high-level programming languages. Breakthroughs in compiler technology are essential to making programming mainstream (Hall et al., 2009). State-of-the-art compiler frameworks such as LLVM and GCC are able to translate the high-level programming languages into an (almost) architecture-independent Intermediate Representation (IR) in which a sequence of optimization passes can be applied to optimize a code segment, iteratively. These optimization passes can be applied at the granularity of Function, Call Graph, Loop, or Module level (Lattner, 2008). Gradually and tentatively, compiler designers come up with a set of fixed-length optimization pipelines, known as standard optimization levels, i.e., -O{1,2,3,s,z}, that on average have shown benefits to optimizing performance (speed) or code size. For instance, LLVM’s O3 has around 160 passes in its pipeline, performing a number of optimizations at different levels of the granularity from which around 60 passes are unique and several sub-sequences of passes are repeated in the optimization sequence (Ashouri et al., 2017). The idea behind producing such a standardized optimization sequence is to enable users with an option that performs on average good (Georgiou et al., 2018).

In the past couple of decades, several attempts have been made to specialize the standard optimization levels given a code segment of interest as they are not always the optimal solution (Agakov et al., 2006; Chen et al., 2012; Park et al., 2013; Ashouri et al., 2016, 2017). These approaches rely on leveraging applications of Machine Learning (ML) and an automatic tuning methodology, possibly using an autotuner (Ansel et al., 2014; Kalyan et al., 2019), to speed up the search in the vast compiler optimization space (Ashouri et al., 2018a). Although in many cases researchers have achieved meaningful results, these approaches are often limited by the use of external/third-party components outside a compiler framework to derive intelligence for it. Fortunately, we have been witnessing more and more proposed works, tackling the above problems by adding end-to-end frameworks (Cummins et al., 2017; Haj-Ali et al., 2020; Das et al., 2020; Rotem and Cummins, 2021; Mannarswamy and Das, 2022), however, to the best of our knowledge, there has been only a single recent work, namely MLGO (Trofin et al., 2021), which does so by refactoring an optimization pass, i.e, function inlining, to bring an ML-based guidance into the optimization pass, as a standalone component. MLGO is also upstreamed into LLVM trunk ¹¹1https://lists.llvm.org/pipermail/llvm-dev/2020-April/140763.html.

Function Inlining is one of the fundamental compiler optimizations used by many state-of-the-art compiler frameworks; when applied to a function, it examines each call site within the function and decides whether or not to inline the function body of the callee into the caller. Not only does inlining eliminate the function call overhead at the call site, it also expands the scope of intra-procedural analyses of subsequent passes and enables additional optimizations (Scheifler, 1977; Theodoridis et al., 2022). When performed incorrectly, however, inlining can cause code size increase, which can lead to performance loss due to instruction cache misses. Therefore, an ideal inlining heuristic must either improve performance without incurring unacceptable code bloat, or reduce code size while avoiding substantial performance loss. Constructing an ideal inlining heuristic has been shown to be a complex problem and the rate at which the number of choices would grow is known to be at least at an exponential increase (Ashouri et al., 2018a; Theodoridis et al., 2022).

Presently, MLGO provides code size reduction to the LLVM’s inline optimization and we based our proposed work on adapting a framework which provides performance optimization for it. Our work, MLGOPerf, proposes the aforementioned contribution to the community, by leveraging two ML models to optimize inline optimization for speed. MLGOPerf, increases the tunable code-regions subsequent to the inline pass under O3 and we show in the experimental results that it can provide an added speedup value with respect to the MLGO’s and the vanilla inliner. MLGOPerf proposes the following contributions:

(1)

We propose an ML model, namely IR2Perf, to predict the post-inlining function speedup of a caller with respect to its baseline. IR2Perf leverages a number of handcrafted features from LLVM IR and it correlates the speedup outcome to the changes in IR as a result of function inlining. The model enables us to rapidly generate Rewards needed to train the existing Reinforcement Learning (RL) agent used for LLVM’s ML-Inliner infrastructure without the need to execute each program thousands of times.
(2)

We add an extra couple of features to better define the behavior of callee functions and to boost the accuracy of the existing LLVM ML-Inliner RL agent.
(3)

At model training, we utilize the newly generated rewards from IR2Perf and revised the RL agent to optimize for performance rather than code size. Finally, at model deployment, we provide a pretrained model to be built with LLVM CMake system and used with LLVM’s ML-Inliner for inference without the need for IR2Perf.

The rest of the paper organizes as follow. Section 2 discusses the state-of-the-art. Section 3 provides details on our proposed method and how the two ML models interact with each other in Sections 3.1 and 3.2, respectively. Section 4 showcases the experimental results. Finally, in Section 5 we reveal the current shortcomings, challenges, and propose some future work.

2. Related Work

Compiler autotuning has become an extensively researched area in the past two decades (Ashouri et al., 2018a), more so with the introduction of the application of Machine Learning (ML) in a number of early approaches, i.e, optimization space reduction (Cooper et al., 1999), estimating unrolling factor (Koseki, 1997), scheduling (Moss et al., 1998), etc. However, the space has gradually shifted towards tackling two major problems, both of which remains open problems, namely (1) the optimization selection problem (Bodin et al., 1998; Agakov et al., 2006; Fursin and Temam, 2009; Park et al., 2013; Ashouri et al., 2016) and, (2) the phase ordering problem (Kulkarni and Cavazos, 2012; Ashouri et al., 2017; Nobre et al., 2018; Cummins et al., 2021b; Wang et al., 2022).

The bulk of the approaches are tackling the above problems as a black-box optimization; by means of an autotuning (Fursin et al., 2011; Chen et al., 2012; Ashouri et al., 2013; Ansel et al., 2014; Kalyan et al., 2019) strategy paired with an ML model, which derives an iterative compilation (Bodin et al., 1998; Agakov et al., 2006; Hoste and Eeckhout, 2008a; Tiwari et al., 2009; Park et al., 2012; Ashouri et al., 2016, 2017, 2018b) methodology to speed up the search.

Recently, we have also witnessed the applications of Deep Learning in the mix. For instance, using Deep Reinforcement Learning (Sutton and Barto, 2018) has become a popular method. On the optimization side, authors tackle the problem of finding the optimal vectorization (Haj-Ali et al., 2020), providing a high-level optimization environment for compiler research (Cummins et al., 2021b), and an autotuner (Park et al., 2022). Additionally, by leveraging Natural Language Processing (NLP) (Young et al., 2018), one can generate an embedding or simply a characterization of the source-level software using representative features (Ben-Nun et al., 2018; Alon et al., 2019; Cummins et al., 2021a). However, there have been a number of works addressing the aforementioned problems with an end-to-end framework and as such, they are tailored towards bringing the autotuning into the compiler (Fursin et al., 2008; Cummins et al., 2017; Haj-Ali et al., 2020; Cummins et al., 2021b; Rotem and Cummins, 2021; Das et al., 2020; Mannarswamy and Das, 2022). These methods still leverage a high-level optimization engine which wraps around the identification of beneficial passes to apply given a segment of code and none has managed to propose a standalone optimization pass capable of deriving decisions by means of inference from an ML model built-into the compiler.

Rotem and Cummins (Rotem and Cummins, 2021) propose an ML framework, leveraging Decision Trees (DS), to provide profile-guided Optimization (PGO) to branch probabilities. The authors use handcrafted features for both basic blocks and branches to characterize a code segment and employ XGBoost (Chen and Guestrin, 2016) library to generate their DS. This work is relevant to ours only because it does the aforementioned methodology built into LLVM and once trained, it can provide PGO inferences without the need to actually executing a number of iterations of PGO profiling.

Phothilimthana et al. (Phothilimthana et al., 2021) propose a two-level autotuner to tune graph-level and subgraph-level optimizations across multiple compilation stages. The authors provide a joint optimization methodology for a number of tensor parameters and operations mostly leveraged in production ML compilers, including tile size, tensor layouts, operator fusion, and code generation. This work is orthogonal to our proposed work.

Trofin et al. (Trofin et al., 2021) propose MLGO which was the first of a kind to provide an ML-guided optimization for a pass, i.e., inline optimization ²²2The repository has since provided an added option for register allocation.. The work has been upstreamed into LLVM and provides guidance to the inline optimization as to whether or not a call site should be inlined, provided that the generated code size is minimized. Our proposed work, MLGOPerf, is similar to MLGO in the sense that we leverage the inlining infrastructure already upstreamed in LLVM repository, however, we adapt a framework to derive decisions to optimize performance rather than a reduction in code size.

Optimizing the execution-time performance of compiler passes is an inherently more difficult problem and we attempt to do so by proposing a second ML model, i.e., IR2Perf, which is able to predict the speedup of a function after it was passed through the function inlining optimization. Leveraging IR2Perf, we generate a fitness function, namely, Rewards to train an RL agent that learns the complex behavior of providing an inlining decision given a call site. Recent studies have suggested using a semi-supervised learning (Konyushkova et al., 2020) and using loss as self-supervision (Shelhamer et al., 2016) to guide the learning of an RL agent with success. We believe, a supervised reward learning process has merits in helping the ML-Inliner agent to learn its optimal policy. Once trained, MLGOPerf is able to provide standalone predictions to LLVM’s ML-Inliner and to optimize segments of the code that are (1) running faster and additionally, (2) contain an enhanced number of inter-procedural opportunities down the pipeline of LLVM’s O3 which may translate into a faster code.

3. Proposed Methodology

Recently, Upstream LLVM Inliner has benefited from MLGO (Trofin et al., 2021) for which the user can leverage an ML-Inliner pass to derive Inlining decisions for call sites. The current ML-Inliner approach is targeted towards code size reduction and, although function inlining is not the first candidate among compiler passes for directly increasing the performance of running codes (Scheifler, 1977; Prokopec et al., 2019; Theodoridis et al., 2022), we believe there is merit in designing a flow in which the ML-Inliner targets a performance gain when decides whether or not to inline a call site. An added benefit of this approach is the new opportunities proper function inlining can provide for subsequent compiler passes within the O3 pipeline.

There are three major challenges and limitations to achieving such behavior: (1) The current Reinforcement Learning (RL) infrastructure designed to reproduce MLGO ³³3https://github.com/google/ml-compiler-opt only tackles the reduction of the code size of a function and we need to adapt the methodology to optimize performance instead. (2) It relies on a set of rewards to guide the training of its policy trajectory(Schulman et al., 2017). (3) training such RL agent requires thousands of evaluations or observations and at each iteration, a reward value is needed to learn the profitability of the action. Therefore, using the actual execution time cannot be a practical solution; this is especially a problem for real-world applications, e.g., SPEC, which takes tens of minutes per round of execution. To this end, we design a second ML model, i.e., IR2Perf, to help alleviate the above challenges. Figure 1 depicts the systematic view of the interaction between the two ML models in our work.

Refer to caption — Figure 1. MLGOPerf Highlevel Flow

3.1. IR2Perf to the Rescue

At the granularity of call sites, ML-Inliner decides whether or not to inline the function body of the callee() into the caller()’s. It does so by using the existing Inference path under MLInlineAdvisor: InlineAdvisor. It collects a number of features, including CalleeBasicBlockCount, CallSiteHeight, NodeCount, etc., to provide an inference to the advisor. However, in order to train such a model, one requires rewards to be fed, at each iteration of its training pipeline, to guide the trajectory of its policy decisions when it takes an action(Trofin et al., 2021). We provide IR2Perf model at the granularity of Function level for our caller() to predict the speedup of the whole function as a proxy to generate rewards for the RL model.

3.2. MLGOPerf Phases

As a result of leveraging two ML models, our proposed work is a multi-phased methodology and thus, in this section we provide insights to better demonstrate the functionalities of our approach. Table 1 showcase the specific interaction between the two models within the three phases. Furthermore, Figure 2 depicts the three phases in higher detail. We discuss the three phases in the following subsections.

Table 1. MLGOPerf Phases

Phases	IR2Perf	RL Model
IR2Perf Training	Training	-
RL Model Training	Inference	Training
MLGOPerf Deployment	-	Inference

3.2.1. IR2Perf Training

In this phase, we design an autotuning methodology that controls the Inlining decisions made at the granularity of the call sites of a function. We do so by leveraging an autotuner (Kalyan et al., 2019), an OpenTuner (Ansel et al., 2014) derivative. This allows us to generate meaningful training data that captures the behavior of function inlining when it decides to inline a call site into its respective caller together with its Function and Module execution times. Additionally, we develop a number of relevant function features, as a pass, and register it subsequent to the Inline optimization into the O3 pipeline to collect our training data.

Let $\omega$ be a characterization vector of features of a function having at minimum one call site. This vector represents $l$ variables to account for the intermediate representation of the function when collected post-inlining. We feed the collected features together with the measured execution time and the corresponding inlining configuration into our ML model. IR2Perf is a Deep Regression model, when trained, designed to predict the relative speedup of a function under analysis wrt O3 when an inlining configuration was applied.

3.2.2. RL Model Training

Reinforcement Learning (RL) is a class of Machine Learning which tries to find an optimal policy for a Markov Decision Process (MDP) that is defined by the tuple of $<S,A,T,R>$ where $S$ is the state space, $A$ represents action space, $R$ is the reward function that an agent receives by doing action a from state s to s’, and $T$ is the transition probability at time t from state s to s’: $T_{a}(s,s^{\prime})=Pr(s_{t+1}=s^{\prime}|s_{t}=s,a_{t}=a)$ (Sutton and Barto, 2018). The goal of RL training is for its agent to learn an optimal policy to maximize its reward function. In this work, we formulate the tuple as follows:

(1)

State $S$ : Current visiting call site
(2)

Action $A$ : Defines a bool variable whether or not to inline the call site
(3)

Transition $T$ : A deterministic function in our context which updates the call graph upon taking an action over the current call site and switching to visit the next call site
(4)

Reward $R$ : In this work, it is defined as the function execution speedup with respect to the baseline

As stated in MLGO (Trofin et al., 2021), training the original method for Speed has difficulties for runtime measurement of a large body of benchmarks and a more complex problem with respect to the training for code size. Unlike a code size reduction objective, we cannot decouple performance by means of a Commutative and Associative properties in an optimization exploration strategy. Therefore, we plan to address this issue by leveraging IR2Perf. There are a number of methods to train such RL agent which uses Proximal Policy Optimization (PPO) (Schulman et al., 2017) algorithm. Algorithm 1 provides an insight into the training flow for which we update the policy of our RL agent using Equation 1. Similar to MLGO, we rely on total reward as the summation of all the rewards generated using IR2Perf. It is worth noting that using IR2Perf, we could also generate partial rewards per function at each time $t$ , but we did not follow this method.

Algorithm 1 Training Inliner RL Model using IR2Perf

1:procedure FunctionSpeedup

\triangleright

IR2Perf Inference

2: for Function f in Module do

FTs\leftarrow\textit{getFunctionFeatures()}

funcReward\leftarrow infer(FTs)

totalReward\leftarrow append(funcReward)

6: end for

7: return totalReward

8:end procedure

10:procedure CallsiteInline

\triangleright

RL Model Training

11: initialize policy

\pi_{\gamma}

randomly

12: for iteration i in Training do

13:

s_{i}\leftarrow Sample_{N(0,I)}(TrainingData)

14: Compile and Get IR with policy

\pi_{\gamma+\sigma s_{i}}

15:

R\leftarrow FunctionSpeedup(Module)

16: Update policy

\gamma

for using Equation 1

17: end for

18:end procedure

(1)

\gamma=\gamma+\alpha\frac{1}{n}\Sigma^{n}_{i=1}\{\Sigma^{t_{n}}_{t=0}R_{total_{i}}\nabla_{\gamma}log\pi_{\gamma}(a_{i,t}|s_{i,t})\}

where $\alpha$ is the learning rate at which the policy $\gamma$ is updated. The RL agent tries to maximize the total reward of its policy when receiving from IR2Perf when it applies an action $(a_{i,t}|s_{i,t})$ .

3.2.3. MLGOPerf Deployment

After our RL model was trained for sufficient iterations (Machado et al., 2018). we unplug IR2Perf and build LLVM with the pretrained RL model. The point at which we stop the training can be determined by occasional evaluation of the performance of the RL model, observing the trajectory of the loss function, or, by means of a training budget constraint, i.e., number of iterations or allocated time. This step is similar to the Release Mode step in MLGO (Trofin et al., 2021).

4. Experimental Results

In this section, we provide the experimental results of our proposed work. From the perspective of Compiler Engineers and in order to fine-tune the quality of the predictions, one has to reproduce the first two phases stated in Section 3.2 and, from the user’s perspective, the third and the final phase is of relevance where we deploy the pretrained model and build it together with LLVM infrastructure. At deployment, there is no need to leverage neither IR2Perf nor RL Model since IR2Perf is no longer needed to generate rewards, and also, the RL model can be compiled and built AOT (Ahead Of Time) using saved_model_cli tool ⁴⁴4https://developers.googleblog.com/2017/03/xla-tensorflow-compiled.html.

4.1. IR2Perf Model

As we discussed in Section 3.2.1, we develop a pass and register it subsequent to the Inline optimization in O3 pipeline to collect the handcrafted features corresponding to a function of interest. In recent years, there have been a number of approaches (Ben-Nun et al., 2018; Alon et al., 2019; Cummins et al., 2021a) proposed to provide an embedding of the Intermediate Representation (IR) of the program for which we could have used but instead, in this work, we decided to directly characterize the space using static features of IR and we show them in Table 2. The overhead caused by collecting these features is negligible.

We collect a total of 370K different inlining configurations from 10 SPEC CPU2006 benchmarks using our autotuner for which we tuned the inlining parameter at the granularity of call site. For practicality, we used the train dataset for SPEC 2006 as opposed to the ref dataset; this limits the runtime of our training samples to be within a minute each on average with a few, i.e., 403.gcc, 462.libquantum, etc., being on the shorter execution time range. Note that the evaluations in Section 4.4 are still performed with the ref dataset. This process takes around 5 days. Execution time of a function call can be too short to allow for a reliable measurement and to this end, we exclude speedup values higher than our exclusion threshold. In this work, we use 3 $\times$ as the exclusion threshold hyperparameter. Additionally, we filter out recorded function overheads of less than than 1% by Perf to avoid a noisy estimate in our training data.

One challenge here is to measure the exact CPU time the program takes on every call of a certain function. We estimate this with the following:

(2)

Func_{runtime}=\frac{(\text{Total Program Runtime)}*T_{Func}}{N_{Func}}

where $T_{Func}$ is the percentage of time the program spent on this function and $N_{Func}$ the number of times this function was called during the execution.

Subsequently, we define the function speedup as the true label for IR2Perf model as:

(3)

Func_{speedup}=\frac{Func_{runtime(Base)}}{Func_{runtime(X)}}

where $Func_{runtime(Base)}$ is the execution time of the function with no inlining configuration and $Func_{runtime(X)}$ , the same function having an inlining configuration of X in the set of all inlining configurations generated by our autotuner. Total program runtime and the $T_{Func}$ are collected with Linux perf and $N_{Func}$ is collected with profiling instrumentation and llvm-profdata tool. In this work, all runtime measurements are done single-threaded, and we use numactl tool to bind the workloads to a unique CPU core. After each measurement run, we flush the system page cache to avoid any perturbance into the collection of our training data.

The goal is to include diverse and meaningful inlining configurations. Each tuning iteration provides us, depending on the number of call sites, a number of data points which can be added to our training data. Many of the function characteristics are correlated to one another in complex ways and also, with the target metric, i.e., function speedup. Thus, we perform a preprocessing stage to better represent the data for training our model. It is well-known that the optimal function inlining problem has an exponential increase in optimization space (Theodoridis et al., 2022; Ashouri et al., 2018a). Therefore, generating sufficient training data which captures as many permutations of combinations of inlinig parameters still remains a pain point. We apply a dimension reduction of features space since it is experimentally observed as the feature space increases, so does the average distance between points. Multidimensional datasets, similar to ours, are subject to rarity (James et al., 2013) and to this end, we employ Principal Component Analysis (PCA) setting PC to 7 to reduce the dimensionality of our feature space while preserving the information of our training data with minimal loss (Deco and Obradovic, 1996).

The preprocessing is done as the following: (0) We compute the global and function speedups of each inlining configuration against its baseline, i.e, O3 (1) We then remove redundant data points from the training data at this stage and normalize the dataset, followed by (3) applying PCA and additionally, as a standard practice, we store the objects of our feature scaling and PCA process to transform our test data into the same scale determined by the training set used to train IR2Perf.

Table 2. IR2Perf Model Features

No	Static Feature Name
1	InstructionPerBlock
2	SuccessorPerBlock
3	AvgNestedLoopLevel
4	InstrPerLoop
5	BlockWithMultipleSuccecorsPerLoop
6	CallsNo
7	IsLocal
8	MaxLoopDepth
9	MaxDomTreeLevel
10	CallerHeight
11	CallUsage
12	IsRecursive
13	NumCallsiteInLoop
14	EntryBlockFreq
15	MaxCallsiteBlockFreq
16	NoOfInstructions=’Ret’
17	NoOfInstructions=’fmul’
18	NoOfInstructions=’fdiv’
19	NoOfInstructions=’fadd’
20	NoOfInstructions=’fsub’

IR2Perf is a Regression model for which there are four linear fully-connected layers followed by an activation function. Here, we use Leaky RelU rather than ReLU as we experimentally observe a slight benefit in the accuracy of the model.

Table 3. IR2Perf Model Architecture

Layer No	Layer (type)	Output Shape
1	Linear-FC1	[-1, 1, 128]
	Leaky_Relu-1	[-1, 1, 128]
2	Linear-FC2	[-1, 1, 256]
	Leaky_Relu-2	[-1, 1, 256]
3	Linear-FC3	[-1, 1, 32]
	Leaky_Relu-3	[-1, 1, 32]
4	Linear-FC4	[-1, 1, 1]

4.2. IR2Perf Accuracy

The training is done using leave-one-out cross-validation for which we exclude one benchmark from the training data and use it later as our test data. This ensures no data leakage occurs. Table 4 represents the accuracy of IR2Perf for each benchmark. In this work, we use Mean Squared Error (MSE) loss function and we observe that on average, IR2Perf achieved an error rate of 12.8% among the SPEC CPU2006 benchmarks with a best 9.1% on 401.bzip2. Figure 3 depicts a number of IR2Perf ML accuracy evaluations. Specifically, Figure 3a represents the loss function of the first epoch of our training. As can be seen, the model converges rapidly with a few iterations of batches. Figure 3 showcases the training graph when 401.bzip2 was excluded from the training data. Subsequently, Figure 3c shows the prediction accuracy after the model was tested on 401.bzip2. Finally, Figure 3d depicts the prediction accuracy of IR2Perf, i.e., 19.1% when 464.h264ref was excluded from the training set (training figure is not shown here for conciseness). The prediction accuracy of IR2Perf is established by the process mentioned above and we use the pretrained model with the lowest error-rate found by means of cross-validation. We chose SPEC as the main suite for data collection as it provides real-world complex function-level code segments which can benefit IR2Perf to characterize the behaviour of inlining.

Table 4. IR2Perf Cross Validation Accuracy

NO.	Benchmark	Prediction Error (MSE)
1	401_bzip2	9.1%
2	456_hmmer	9.5%
3	462_libquantum	15.7%
4	464_h264ref	19.1%
5	445_gobmk	17.5%
6	470_lbm	9.8%
7	458_sjeng	13.5%
8	429_mcf	12.2%
9	433_milc	13.9%
10	482_sphinx3	11.2%
Geometric Mean		12.8%

4.3. MLGOPerf Training

Once the efficient accuracy of IR2Perf is established, we infer from it to generate rewards for our RL agent trainer. In this stage, i.e., phase 2 of Figure 2, we deploy IR2Perf into MLGOPerf::RLTraining pipeline, and given an incoming function in a module, we collect the caller’s function feature to be fed as input to IR2Perf. This will guide the RL agent’s policy as to how well an inlining configuration potentially performs given its callee function representation which we also collect and feed into the trainer. In the original version of MLGO (Trofin et al., 2021), authors propose 11 callee features to be used with the RL trainer ⁵⁵5Specifically, we use LLVM 12 as at commit 1dad9d42 and MLGO Github repository as at commit dac1b149, we add two more relevant features we believe have added representational value, namely (1) Block Frequency and (2) Loop Level. Our early results show benefits to the final MLGOPerf model when we employ our two handcrafted features to the mix by around 1%. These two will be used together with the previously introduced 11 features and they characterize the caller function for the RL agent.

We train the RL agent using the default hyperparameters mentioned in the original MLGO work, except enabling rewards to be normalized as suggested by (Schulman et al., 2017). Rapidly generating rewards using IR2Perf, training the RL agent takes around 4 days using an NVIDIA Tesla P100 PCIe 12GB and an Intel(R) Xeon(R) CPU E7-8890 v4 @ 2.20GHz on Linux 18.04.1 LTS when trained on module samples taken from SPEC CPU2006. Note that IR2Perf reward generator is not limited to any benchmark it was trained on or cross-validated with, as it can liberally infer the function speedup of any unseen function when only fed the IR features defined under Table 2 and thus, we successfully eliminate the need to collect any runtime metrics, i.e, execution time, and to largely accelerate the training process.

4.4. MLGOPerf Deployment

We deploy the MLGOPerf under the third and final phase of our proposed work. Similar to MLGO, the deployed model when compiled AOT is emitted as a library which can be invoked when we feed as input the 13 callee features of any unseen applications. Note that at this phase, we no longer need to use IR2Perf as it has already done its job to help train the RL agent. The overhead of collecting these callee features is minimal and MLInliner framework can leverage the decision made by the RL model to decide whether or not to inline a call site to optimize the performance of an application under analysis. Similar to Section 4.1, all run time measurements are done single-threaded, and we use numactl tool to bind the workloads to a unique CPU core and we made sure we flush the system page cache after every measurement to avoid any protuberance into the training data. The experiments are measured on an ARMv8.2-A (Kunpeng 920) architecture @ 2.6GHz on Linux. We run each benchmark five times, having flushed the system page cache after every run, and we use a trimmed mean method to drop the minimum and the maximum measurements and perform an average of the remaining three measurements. Additionally, we collect the variance between the measurements and we make sure to repeat the measurements for every instance of evaluation having a variance of more than 2%.

4.5. Performance Improvement

Table 5 shows our experimental results using SPEC CPU2006 with the ref dataset (Table 5a) and Cbench (Table 5b). As can be seen, MLGOPerf on average achieves 1.8% and 2.2% speedup against LLVM’s O3 on SPEC CPU2006 and Cbench, respectively. Additionally, we compare the performance of MLGOPerf with respect to MLGO and on average it outperforms MLGO by 3.1% and 4.1% on SPEC2006 CPU and Cbench, respectively. The average variance between the runs is measured to be around 0.3% to 0.4% on SPEC and 0.43% to 0.86% on Cbench. As expected, there is a slight increase in the code size of the optimized binaries MLGOPerf generates, and that is measured as 12% and 16% against LLVM’s O3 and MLGO on Cbench and 17.8% and 23.4% for SPEC CPU2006. MLGOPerf’s is trained to optimize performance as its objective and it is reasonable to observe an increase in the code size, especially compared with MLGO which was solely trained to optimize code size.

Table 5. Performance improvement against LLVM’s O3 and MLGO

Benchmark

Speedup

wrt O3

Measured

Variance

Speedup

wrt MLGO

Measured

Variance

Code size

Increase wrt O3

Code size

Increase wrt MLGO

401.bzip2

1.052

0.966

1.072

0.594

1.131

1.248

403.gcc

1.022

0.921

1.054

0.004

1.227

1.411

429.mcf

1.009

1.021

1.031

1.242

1.047

1.077

445.gobmk

1.030

0.249

1.044

0.135

1.097

1.104

456.hmmer

0.997

0.040

1.020

0.077

1.227

1.273

458.sjeng

1.003

0.354

1.040

0.031

1.318

1.373

462.libquantum

1.040

1.856

1.051

0.029

1.257

1.428

464.h264ref

1.068

0.620

1.088

0.782

1.389

1.312

471.omnetpp

1.004

1.107

0.999

1.091

1.146

1.198

433.milc

1.021

0.566

0.999

0.486

1.297

1.276

444.namd

0.992

0.530

1.015

0.016

1.002

1.018

453.povray

0.997

0.416

1.035

0.022

1.237

1.418

470.lbm

1.020

0.025

1.004

0.005

1.025

1.031

482.sphinx3

0.993

0.676

0.992

0.070

1.167

1.225

Geomean

1.018

0.434%

1.031

0.086%

1.178

1.235

(a) SPEC CPU2006 (w/ ref dataset) Performance Improvement

Benchmark

Speedup

wrt O3

Measured

Variance

Speedup

wrt MLGO

Measured

Variance

Code size

Increase wrt O3

Code size

Increase wrt MLGO

automotive_bitcount

1.002

0.18%

1.005

0.17%

1.000

automotive_qsort1

1.000

0.14%

1.003

0.12%

1.000

automotive_susan_c

1.020

0.32%

1.122

1.02%

1.349

1.567

automotive_susan_e

1.015

0.28%

1.044

1.71%

1.349

1.567

automotive_susan_s

1.002

0.29%

1.008

0.23%

1.349

1.567

bzip2d

1.025

0.70%

1.012

0.70%

1.173

1.239

bzip2e

1.014

0.53%

1.041

0.53%

1.113

1.189

consumer_jpeg_c

1.002

0.41%

1.005

0.37%

1.045

1.200

consumer_jpeg_d

1.041

0.22%

1.335

1.36%

1.037

1.198

consumer_lame

1.006

0.13%

1.007

0.10%

1.181

1.137

consumer_tiff2bw

1.027

0.30%

1.023

0.32%

1.298

1.356

consumer_tiff2rgba

1.051

0.38%

1.110

1.71%

1.294

1.351

consumer_tiffdither

1.006

0.20%

1.014

0.11%

1.190

1.146

consumer_tiffmedian

1.012

0.12%

1.056

0.16%

1.048

network_dijkstra

1.005

1.20%

0.991

1.30%

1.000

network_patricia

1.000

0.35%

0.999

0.29%

1.007

1.052

office_stringsearch1

0.996

0.50%

1.000

0.91%

1.000

security_blowfish_d

1.005

0.75%

1.002

1.73%

1.000

security_blowfish_e

1.059

0.13%

1.029

1.80%

1.412

1.390

security_rijndael_d

1.042

0.64%

1.039

0.21%

1.412

1.390

security_rijndael_e

1.037

0.07%

1.035

0.13%

1.009

security_sha

1.134

1.63%

1.132

0.09%

1.009

telecom_adpcm_c

1.001

0.13%

1.001

0.21%

0.996

telecom_adpcm_d

1.001

0.15%

1.000

0.19%

1.000

telecom_CRC32

1.065

0.44%

1.066

0.57%

1.000

Geomean

1.022

0.3%

1.041

0.4%

1.121

1.161

(b) Cbench Performance Improvement

4.6. Performance Enablement with Autotuning

Table 6. MLGOPerf Enhanced Autotuning Code Region Opportunities

Cbench

MLGO

MLGOPerf

Autotuning Speedup

Tunable Regions

Autotuning Speedup

Tunable Regions

Autotuning Speedup

Tunable Regions

Tunable

Regions

wrt

MLGO

automotive_bitcount

1.027714

1.019832

1.02781

1.000

automotive_qsort1

1.009412

1.008607

1.009123

1.000

automotive_susan_c

1.038951

116

1.036704

112

1.037121

188

1.621

1.679

automotive_susan_e

1.031977

116

1.026087

112

1.033349

188

1.621

1.679

automotive_susan_s

1.001988

116

1.065891

112

1.146078

188

1.621

1.679

bzip2d

1.15753

637

1.100431

580

1.165478

747

1.173

1.288

bzip2e

1.032093

637

1.026258

580

1.033333

747

1.173

1.288

consumer_jpeg_c

1.040332

1049

1.017417

891

1.043764

1204

1.148

1.351

consumer_jpeg_d

1.031342

1074

1.014804

885

1.017778

1148

1.069

1.297

consumer_tiff2bw

1.004812

641

1.018229

619

1.017452

903

1.409

1.459

consumer_tiff2rgba

1.047902

633

1.122697

611

1.123338

907

1.433

1.484

consumer_tiffdither

1.012297

640

1.004719

614

1.007853

902

1.409

1.469

consumer_tiffmedian

0.973255

741

1.001938

715

0.988845

1069

1.443

1.495

network_dijkstra

1.078947

1.087719

1.061404

1.692

network_patricia

1.015152

1.008772

1.000

office_rsynth

0.998958

152

1.001032

147

1.018398

153

1.007

1.041

security_blowfish_d

1.001764

1.000441

0.998679

1.000

security_blowfish_e

1.001314

1.002632

1.001314

1.000

security_pgp_d

1.019659

955

1.017919

929

1.036332

1317

1.379

1.418

security_pgp_e

1.039591

955

1.038804

929

1.04023

1317

1.379

1.418

security_rijndael_d

1.040965

1.048175

1.04694

1.136

security_rijndael_e

1.018811

1.02481

1.029064

1.136

security_sha

1.009674

1.004434

1.101781

1.300

telecom_adpcm_c

1.006329

1.004211

1.002101

1.000

telecom_adpcm_d

1.039636

1.038986

1.000

telecom_CRC32

1.003663

1.001217

1.003663

1.250

telecom_gsm

1.010018

115

1.009991

112

1.007286

118

1.026

1.054

Geomean

1.025198

—

1.027676

—

1.037887

—

1.218

1.260

As discussed earlier, an added benefit of utilizing an optimized function inlining pass is the enablement of tunable opportunities for subsequent passes which may translate into an increased performance. In this section, we experimentally explore this phenomenon by designing an autotuning methodology to detect whether or not MLGOPerf provides an increased code-region opportunity to the code segments of interest and that if it can lead to finding more optimal configurations. We leverage our autotuner to explore code-region opportunities for loop-unroll and loop-vectorize by tuning unroll-count $\in\{0,2,4,8\}$ and interleave-count $\in\{1,2,4\}$ .

Iteratively, our autotuner uses the list of opportunities to determine optimal configurations by using different search techniques. Then, the compiler uses the configurations suggested by the autotuner and generates a new executable file. We run the emitted executable and provide the execution times to the autotuner as feedback which will be leveraged by the autotuner to generate a new set of configurations based on the feedback. We repeat this tuning process until the stop criteria are satisfied. For practicality, we set the iteration number to 120 per benchmark and we showcase the results for Cbench. Similar to the other experimental results in this work, we used a trimmed mean of runs per benchmark at each iteration. The geometric mean of the measured variance between the runs is recorded at 0.164%.

Table 6 presents this evaluation. We report O3, MLGO, and our proposed work in the scenario where we start by tuning the available parameters and record the number of code region opportunities the autotuner finds suitable to tune. Columns Autotuning Speedup represent the best found configuration at the end of the tuning given our budget. The speedup values are all normalized to LLVM’s O3 and it reveals the following. (1) MLGOPerf enables enhanced autotuning opportunities for subsequent passes. These values are reported under the final two columns as MLGOPerf provides an increased number of up to 21.8% and 26% for tunable code regions against LLVM’s O3 and MLGO, respectively. (2) Understandably, not all enhanced opportunities will translate into higher performance values. As the autotuner explores the optimization space, many of the combinations do not lead to any better outcome, however, we experimentally observe that using MLGOPerf, the rate at which the former translates into higher speedup values is higher. These values are 2.5%, 2.7%, and 3.7% for LLVM’s O3, MLGO, and MLGOPerf, respectively.

5. Discussion

MLGOPerf is the first step towards an ML-guided type of compiler optimizations targeting performance and although it provides benefits to the function inlining and its subsequent optimization passes, it is still in its infancy and we are distant from an ideal standalone compiler pass that derives optimal decisions using ML. Function inlining is a unique optimization which can easily slow down the performance of an executable; no inlining or aggressive inlining both can hurt the performance of a running code and there is a fine line to walk through between the two extremes. There are a number of challenges and shortcomings we would like to mention here.

5.1. Challenges

Function vs. Global speedup

MLGOPerf uses the predicted function speedup values of IR2Perf as a proxy to guide the training of its RL agent towards an optimal function inlining that benefits the global speedup of program. However, we experimentally notice that on 15% of the cases in our training data, these two metrics are contracting one another in a sense that an increase in function speedup led to a decrease in global speedup. This is a common challenge of optimizing applications using a finer-grained performance metric for which we eluded to in Section 3.2.2 regarding performance decoupling. There are several factors involved in this phenomenon, i.e., cache hierarchy, memory-bound workloads, hardware microarchitecture design, etc., and are outside the scope of this work (Theodoridis et al., 2022).

Multi-objective optimization

MLGOPerf attempts to optimize the performance of a code segment with intelligent inlining decisions derived by ML. However, in a number of cases, we experimentally observe that the emitted code size is also increased. Although this is outside the scope of the current work, an enhanced function inlining may also benefit from taking into account both of the objectives in its exploration strategy by identifying the Pareto optimality (Silva et al., 2021). Similar strategies are employed for a number of adjacent problems; i.e, Multicore Embedded Systems (Ascia et al., Jan; Silvano et al., 2011) and Compiler autotuning (Hoste and Eeckhout, 2008b; Ashouri et al., 2013; Moreau et al., 2018).

Compiler optimization space

It is an inherently difficult problem to identify the optimal set of optimization passes given a code segment. There are a plethora of permutations of optimizations which can be applied to increase the performance of a running code and the problem quickly becomes an NP-hard problem (Ashouri et al., 2018a). Additionally, the optimization space is unbounded as there is no decision boundary for the length of an optimization sequence one can add to the optimization pipeline. Similar to Halting problem (Burkholder, 1987), a general algorithm to solve the selection and the phase-ordering problem for all possible program-input pairs cannot exist.

Compiler performance prediction

Another fundamentally difficult problem with MLGOPerf and any other performance prediction framework is the fact that predicting the CPU time/cycles are architecture and compiler dependent. There are many unknown factors in place to affect the performance of a binary and at the same time, we always, at best, measure a noisy estimate of the performance metrics and for these reasons, predicting the performance or speedup remains a difficult problem.

Software characterization

MLGOPerf leverages our handcrafted features to characterize a segment of a code. However, this process has an unwanted discretization noise which can be reduced by means of finer-grained embedding which is outside the scope of this work. We acknowledge that one of the toughest challenges in the compiler space is the lack of an ideal characterization of software in a way that ML models can be applied and we are no exception here.

6. Conclusion

In this paper, we presented MLGOPerf, the first end-to-end framework capable of optimizing performance using ML-Inliner. We experimentally demonstrated that using MLGOPerf, we are able to achieve up to 1.8% and 2.5% performance speedup over LLVM’s O3 on SPEC CPU 2006 and Cbench while expanding the horizons of code-regions opportunities for subsequent passes up to 26% which can add further 3.7% speedup to the emitted binary at the end of the O3 pipeline. Future works will be focused around generalizing a flow for which other compiler optimizations can be tackled in a seamless manner.

References

(1)
Agakov et al. (2006) Felix Agakov, Edwin Bonilla, John Cavazos, Björn Franke, Grigori Fursin, Michael FP O’Boyle, John Thomson, Marc Toussaint, and Christopher KI Williams. 2006. Using machine learning to focus iterative optimization. In Proceedings of the International Symposium on Code Generation and Optimization. IEEE Computer Society, 295–305.
Alon et al. (2019) Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages 3, POPL (2019), 1–29.
Ansel et al. (2014) Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O’Reilly, and Saman Amarasinghe. 2014. Opentuner: An extensible framework for program autotuning. In Proceedings of the 23rd international conference on Parallel architectures and compilation. 303–316.
Ascia et al. (Jan) G. Ascia, V. Catania, M. Palesi, and D. Patti. Jan.. A system-level framework for evaluating area/performance/power trade-offs of VLIW-based embedded systems. In Design Automation Conference, 2005. Proceedings of the ASP-DAC 2005. Asia and South Pacific, Vol. 2. 940–943 Vol. 2.
Ashouri et al. (2017) Amir H. Ashouri, Andrea Bignoli, Gianluca Palermo, Cristina Silvano, Sameer Kulkarni, and John Cavazos. 2017. MiCOMP: Mitigating the Compiler Phase-Ordering Problem Using Optimization Sub-Sequences and Machine Learning. ACM Trans. Archit. Code Optim. 14, 3, Article 29 (Sept. 2017), 28 pages. https://doi.org/10.1145/3124452
Ashouri et al. (2018a) Amir H. Ashouri, William Killian, John Cavazos, Gianluca Palermo, and Cristina Silvano. 2018a. A survey on compiler autotuning using machine learning. ACM Computing Surveys (CSUR) 51, 5 (2018), 1–42. https://doi.org/10.1145/3197978
Ashouri et al. (2016) Amir Hossein Ashouri, Giovanni Mariani, Gianluca Palermo, Eunjung Park, John Cavazos, and Cristina Silvano. 2016. COBAYN: Compiler Autotuning Framework Using Bayesian Networks. ACM Trans. Archit. Code Optim. (TACO) 13, 2, Article 21 (June 2016), 25 pages. https://doi.org/10.1145/2928270
Ashouri et al. (2018b) Amir H. Ashouri, Gianluca Palermo, John Cavazos, and Cristina Silvano. 2018b. Automatic Tuning of Compilers Using Machine Learning. Springer. https://doi.org/10.1007/978-3-319-71489-9
Ashouri et al. (2013) Amir Hossein Ashouri, Vittorio Zaccaria, Sotirios Xydis, Gianluca Palermo, and Cristina Silvano. 2013. A framework for Compiler Level statistical analysis over customized VLIW architecture. In VLSI-SoC. 124–129. https://doi.org/10.1109/VLSI-SoC.2013.6673262
Ben-Nun et al. (2018) Tal Ben-Nun, Alice Shoshana Jakobovits, and Torsten Hoefler. 2018. Neural code comprehension: A learnable representation of code semantics. Advances in Neural Information Processing Systems 31 (2018).
Bodin et al. (1998) François Bodin, Toru Kisuki, Peter Knijnenburg, Mike O’Boyle, and Erven Rohou. 1998. Iterative compilation in a non-linear optimisation space. In Workshop on Profile and Feedback-Directed Compilation.
Burkholder (1987) Leslie Burkholder. 1987. The halting problem. ACM SIGACT News 18, 3 (1987), 48–60.
Chen and Guestrin (2016) Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785–794.
Chen et al. (2012) Yang Chen, Shuangde Fang, Yuanjie Huang, Lieven Eeckhout, Grigori Fursin, Olivier Temam, and Chengyong Wu. 2012. Deconstructing iterative optimization. ACM Transactions on Architecture and Code Optimization (TACO) 9, 3 (2012), 21.
Cooper et al. (1999) KD Cooper, PJ Schielke, and D Subramanian. 1999. Optimizing for reduced code space using genetic algorithms. ACM SIGPLAN Notices (1999). http://dl.acm.org/citation.cfm?id=314414
Cummins et al. (2021a) Chris Cummins, Zacharias V Fisches, Tal Ben-Nun, Torsten Hoefler, Michael FP O’Boyle, and Hugh Leather. 2021a. Programl: A graph-based program representation for data flow analysis and compiler optimizations. In International Conference on Machine Learning. PMLR, 2244–2253.
Cummins et al. (2017) C. Cummins, P. Petoumenos, Z. Wang, and H. Leather. 2017. End-to-End Deep Learning of Optimization Heuristics. In 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). 219–232. https://doi.org/10.1109/PACT.2017.24
Cummins et al. (2021b) Chris Cummins, Bram Wasti, Jiadong Guo, Brandon Cui, Jason Ansel, Sahir Gomez, Somya Jain, Jia Liu, Olivier Teytaud, Benoit Steiner, et al. 2021b. CompilerGym: Robust, Performant Compiler Optimization Environments for AI Research. arXiv preprint arXiv:2109.08267 (2021).
Das et al. (2020) Dibyendu Das, Shahid Asghar Ahmad, and Venkataramanan Kumar. 2020. Deep Learning-based Approximate Graph-Coloring Algorithm for Register Allocation. In 2020 IEEE/ACM 6th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC) and Workshop on Hierarchical Parallelism for Exascale Computing (HiPar). IEEE, 23–32.
Deco and Obradovic (1996) Gustavo Deco and Dragan Obradovic. 1996. An information-theoretic approach to neural computing. Springer Science & Business Media.
Fursin et al. (2011) G Fursin, Y Kashnikov, and AW Memon. 2011. Milepost gcc: Machine learning enabled self-tuning compiler. International journal of parallel programming 39, 3 (2011), 296–327. http://link.springer.com/article/10.1007/s10766-010-0161-2
Fursin et al. (2008) G Fursin, C Miranda, and O Temam. 2008. MILEPOST GCC: machine learning based research compiler. GCC Summit (2008). https://hal.inria.fr/inria-00294704/
Fursin and Temam (2009) Grigori Fursin and Olivier Temam. 2009. Collective optimization. In International Conference on High-Performance Embedded Architectures and Compilers. Springer, 34–49.
Georgiou et al. (2018) Kyriakos Georgiou, Craig Blackmore, Samuel Xavier-de Souza, and Kerstin Eder. 2018. Less is More: Exploiting the Standard Compiler Optimization Levels for Better Performance and Energy Consumption. arXiv preprint arXiv:1802.09845 (2018).
Haj-Ali et al. (2020) Ameer Haj-Ali, Nesreen K Ahmed, Ted Willke, Yakun Sophia Shao, Krste Asanovic, and Ion Stoica. 2020. Neurovectorizer: End-to-end vectorization with deep reinforcement learning. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization. 242–255.
Hall et al. (2009) M Hall, D Padua, and K Pingali. 2009. Compiler research: the next 50 years. Commun. ACM (2009). http://dl.acm.org/citation.cfm?id=1461946
Hoste and Eeckhout (2008a) K Hoste and L Eeckhout. 2008a. Cole: compiler optimization level exploration. In Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization. 165–174. http://dl.acm.org/citation.cfm?id=1356080
Hoste and Eeckhout (2008b) Kenneth Hoste and Lieven Eeckhout. 2008b. Cole: compiler optimization level exploration. In Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization. 165–174.
James et al. (2013) Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An introduction to statistical learning. Vol. 112. Springer.
Kalyan et al. (2019) Michael Kalyan, Xiang Wang, Ahmed Eltantawy, and Yaoqing Gao. 2019. Code Region Based Auto-Tuning Enabled Compilers. In Workshop on the Intersection of High Performance Computing and Machine Learning (HPCaML). ACM.
Konyushkova et al. (2020) Ksenia Konyushkova, Konrad Zolna, Yusuf Aytar, Alexander Novikov, Scott Reed, Serkan Cabi, and Nando de Freitas. 2020. Semi-supervised reward learning for offline reinforcement learning. arXiv preprint arXiv:2012.06899 (2020).
Koseki (1997) A Koseki. 1997. A method for estimating optimal unrolling times for nested loops. In Parallel Architectures, Algorithms, and Networks, 1997.(I-SPAN’97) Proceedings., Third International Symposium on. 376–382. http://ieeexplore.ieee.org/xpls/abs{_}all.jsp?arnumber=645123
Kulkarni and Cavazos (2012) S Kulkarni and J Cavazos. 2012. Mitigating the compiler optimization phase-ordering problem using machine learning. ACM SIGPLAN Notices (2012). http://dl.acm.org/citation.cfm?id=2384628
Lattner (2008) Chris Lattner. 2008. LLVM and Clang: Next generation compiler technology. In The BSD conference, Vol. 5.
Machado et al. (2018) Marlos C Machado, Marc G Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. 2018. Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research 61 (2018), 523–562.
Mannarswamy and Das (2022) Sandya Mannarswamy and Dibyendu Das. 2022. Learning to Combine Instructions in LLVM Compiler. arXiv preprint arXiv:2202.12379 (2022).
Moreau et al. (2018) Thierry Moreau, Anton Lokhmotov, and Grigori Fursin. 2018. Introducing ReQuEST: an Open Platform for Reproducible and Quality-Efficient Systems-ML Tournaments. CoRR abs/1801.06378 (2018). arXiv:1801.06378 http://arxiv.org/abs/1801.06378
Moss et al. (1998) Eliot Moss, Paul Utgoff, John Cavazos, Doina Precup, D Stefanovic, Carla Brodley, and David Scheeff. 1998. Learning to schedule straight-line code. Advances in Neural Information Processing Systems 10 (1998), 929–935. http://books.nips.cc/papers/files/nips10/0929.pdf
Nobre et al. (2018) Ricardo Nobre, Luís Reis, and João M. P. Cardoso. 2018. Impact of Compiler Phase Ordering When Targeting GPUs. In Euro-Par 2017: Parallel Processing Workshops, Dora B. Heras and Luc Bougé (Eds.). Springer International Publishing, Cham, 427–438.
Park et al. (2012) E Park, J Cavazos, and MA Alvarez. 2012. Using graph-based program characterization for predictive modeling. Proceedings of the International Symposium on Code Generation and Optimization (2012), 295–305. http://dl.acm.org/citation.cfm?id=2259042
Park et al. (2013) E Park, J Cavazos, and LN Pouchet. 2013. Predictive modeling in a polyhedral optimization space. International journal of parallel programming (2013), 704–750. http://link.springer.com/article/10.1007/s10766-013-0241-1
Park et al. (2022) Sunghyun Park, Salar Latifi, Yongjun Park, Armand Behroozi, Byungsoo Jeon, and Scott Mahlke. 2022. SRTuner: Effective Compiler Optimization Customization by Exposing Synergistic Relations. In 2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 118–130.
Phothilimthana et al. (2021) Phitchaya Mangpo Phothilimthana, Amit Sabne, Nikhil Sarda, Karthik Srinivasa Murthy, Yanqi Zhou, Christof Angermueller, Mike Burrows, Sudip Roy, Ketan Mandke, Rezsa Farahani, et al. 2021. A Flexible Approach to Autotuning Multi-Pass Machine Learning Compilers. In 2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, 1–16.
Prokopec et al. (2019) Aleksandar Prokopec, Gilles Duboscq, David Leopoldseder, and Thomas Wïrthinger. 2019. An optimization-driven incremental inline substitution algorithm for just-in-time compilers. In 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 164–179.
Rotem and Cummins (2021) Nadav Rotem and Chris Cummins. 2021. Profile Guided Optimization without Profiles: A Machine Learning Approach. arXiv preprint arXiv:2112.14679 (2021).
Scheifler (1977) Robert W Scheifler. 1977. An analysis of inline substitution for a structured programming language. Commun. ACM 20, 9 (1977), 647–654.
Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
Shelhamer et al. (2016) Evan Shelhamer, Parsa Mahmoudieh, Max Argus, and Trevor Darrell. 2016. Loss is its own reward: Self-supervision for reinforcement learning. arXiv preprint arXiv:1612.07307 (2016).
Silva et al. (2021) Anderson Faustino da Silva, Bernardo NB de Lima, and Fernando Magno Quintão Pereira. 2021. Exploring the space of optimization sequences for code-size reduction: insights and tools. In Proceedings of the 30th ACM SIGPLAN International Conference on Compiler Construction. 47–58.
Silvano et al. (2011) Cristina Silvano, William Fornaciari, Gianluca Palermo, Vittorio Zaccaria, Fabrizio Castro, Marcos Martinez, Sara Bocchio, Roberto Zafalon, Prabhat Avasare, Geert Vanmeerbeeck, et al. 2011. Multicube: Multi-objective design space exploration of multi-core architectures. In VLSI 2010 Annual Symposium. Springer, 47–63.
Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.
Theodoridis et al. (2022) Theodoros Theodoridis, Tobias Grosser, and Zhendong Su. 2022. Understanding and exploiting optimal function inlining. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 977–989.
Tiwari et al. (2009) A Tiwari, C Chen, and J Chame. 2009. A scalable auto-tuning framework for compiler optimization. In Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on. 1–12. http://ieeexplore.ieee.org/xpls/abs{_}all.jsp?arnumber=5161054
Trofin et al. (2021) Mircea Trofin, Yundi Qian, Eugene Brevdo, Zinan Lin, Krzysztof Choromanski, and David Li. 2021. Mlgo: a machine learning guided compiler optimizations framework. arXiv preprint arXiv:2101.04808 (2021).
Wang et al. (2022) Huanting Wang, Zhanyong Tang, Cheng Zhang, Jiaqi Zhao, Chris Cummins, Hugh Leather, and Zheng Wang. 2022. Automating reinforcement learning architecture design for code optimization. In Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction. 129–143.
Young et al. (2018) Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. 2018. Recent trends in deep learning based natural language processing. ieee Computational intelligenCe magazine 13, 3 (2018), 55–75.