\publyear

22 \papernumber2102 backgroundcolor=, basicstyle=, breakatwhitespace=false, breaklines=true, captionpos=b, commentstyle=, deletekeywords=..., escapeinside=%**), extendedchars=true, keepspaces=true, keywordstyle=, language=C, morekeywords=__kernel, __local, __global, barrier, rulecolor=, showspaces=false, showstringspaces=false, showtabs=false, stepnumber=1, stringstyle=, tabsize=1, backgroundcolor=, basicstyle=, breakatwhitespace=false, breaklines=true, captionpos=b, commentstyle=, deletekeywords=..., escapeinside=%**), extendedchars=true, keepspaces=true, keywordstyle=, language=C, morekeywords=chan, byte, od, fi, atomic, proctype, select, run, active, inline, rulecolor=, showspaces=false, showstringspaces=false, showtabs=false, stepnumber=1, stringstyle=, tabsize=1,

Auto-Tuning High-Performance Programs
Using Model Checking in Promela

Natalia Garanina
On visit at the University of Muenster Natalia Garanina was supported by the scholarship of DAAD (German Academic Exchange Service). Germany Sergey Staroletov Sergei Gorlatch
University of Muenster Sergei Gorlatch is supported by the DFG project Nr. 470527619 (PPP-DL) at the University of Munster. Muenster Germany

Abstract

The paper combines research approaches that traditionally have been disjoint: 1) model checking as used in formal verification of programs, and 2) auto-tuning as often used in high-performance computing. Auto-tuning frameworks optimize parallel programs by finding the optimal values of the performance-critical parameters — so-called tuning parameters — for a particular high-performance architecture and input data size. As there are many parameters influencing program’s performance, finding the optimal parameter configuration is a hardly manageable task even for experts. Auto-tuning automates this process, but it is often time-consuming. We apply model checking for accelerating auto-tuning by using a counterexample constructed during the verification of the optimality property of the program. We describe in detail an implementation of our approach for programs written in OpenCL – the standard for programming modern high-performance architectures – using the model representation language Promela and the popular SPIN verification tool, and we report experimental results for an application use case.

keywords:

model checking, temporal logics, counterexamples, high-performance computing, auto-tuning, SPIN, Promela

^†^†volume: 185^†^†issue: 1

Auto-Tuning Using Model Checking

1 Motivation and Related Work

We aim at bringing together two research areas that have traditionally been disjoint:1) model checking used in formal verification of programs, and 2) auto-tuning used in the optimization of programs for modern high-performance architectures that comprise multi-core universal processors (CPU) and many-core Graphics Processing Units (GPU).

Auto-tuning [1] is an important case of the general concept of automated algorithm configuration and parameter tuning. It has become especially popular recently in developing parallel programs: it finds the values of the performance-critical program parameters that ensure optimal performance of the program code on a particular target architecture and problem data size. Typical examples of such tuning parameters for parallel programs in the OpenCL language are: number of threads, tile size, workgroup size, shared-memory block size, etc. The potentially large number of tuning parameters and their complicated interplay in influencing program performance makes it very difficult even for an expert to find a (near to) optimal combination of parameter values that ensures good performance on a particular machine and for a particular size of input data.

For the implementation of auto-tuning, an auto-tuning framework (auto-tuner) is used: the framework automatically generates a search space of parameter configurations and then examines this space by running the program for chosen configurations on a real hardware of the target high-performance system. The currently popular auto-tuners include OpenTuner [2], PATUS [3], ATLAS [4], FFTW [5], CLTune [6], ATF [7], to name a few. They use a variety of techniques to find the optimal solution, including simulated annealing, random search, genetic algorithms, machine learning, and dynamic programming. Only few of auto-tuners nowadays offer exhaustive search, which guarantees to find the optimal result but implies extremely high time costs for checking all possible configurations: the search for a solution of high quality may take several hours or even days on a high-performance system.

In this paper, we attempt to improve auto-tuning by using modern model checking tools. The goal of our approach is to automatically find good tuning parameter values faster than existing auto-tuners, and without using real hardware. Model checking is widely used for formal program verification: it checks the satisfiability of a particular program property for all scenarios of a program model behavior. Both the model and the property are specified using formal languages. For example, a program model can be represented as a Kripke structure, and a property as an LTL temporal logic formula. If the property is not satisfied then the model checking method can find a counterexample, i.e., conditions and sequence of actions of the model that imply that the property is not satisfied.

Our approach to auto-tuning based on model checking relies on exhaustive or swarm search by using a counterexample constructed during the verification of the optimality property for a program which is to be auto-tuned. We have been inspired by previous research on using counterexamples for solving optimization problems and, in particular, compiling optimal schedules for reactive systems [8, 9, 10, 11, 12]. A special edition of Dagstuhl seminar [13] was devoted to the interrelation of automatic planning/optimization and model checking methods. The recent progress in model checking has achieved that modern tools examine each element of the search space using various shortcuts, such as symbolic encoding, partial order reduction, and abstraction [14]. Due to these shortcuts, exhaustive exploration of larger search spaces can be done in a shorter time.

An important expected benefit of using model checking for auto-tuning is that the target high-performance architecture does not have to be available when searching for the optimal parameter values of a parallel program designed for this architecture. We are not aware of previous approaches to use model checking for auto-tuning programs designed for high-performance systems with modern multi- and many-core processors. Malik and Pena [15] use a custom model checker to compute optimal parameter configuration for multi-variable systems in the automotive industry. This task can be considered similar to auto-tuning, but it does not use the counterexample method that we apply in the present paper.

In this paper, the target area for auto-tuning are parallel programs written in OpenCL [16] which is an open standard approach to programming various modern architectures: multi-core CPU, many-core GPU, FPGA, and others. In order to ensure generality and flexibility of our approach, we assume the abstract platform model of OpenCL as our target architecture, rather than a particular kind of architecture. As our model checking tool, we use the SPIN verifier [17] with its modeling language Promela [18] whose semantics combines the CSP (Communicating Sequential Processes) [19] parallelism and the actor model [20].

We initially proposed our idea of using model checking for auto-tuning in the short, Russian-language paper [21] where we model the Nvidia’s Fermi architecture as an abstract multiprocessor with a warp scheduling for processing elements. The conference paper [22] refines our initial ideas and extends them by developing a first proof-of-concept implementation of auto-tuning for OpenCL programs using the SPIN model checker and conducting first experiments with it. This paper is a substantially rewritten and extended version of [22], with the following major additions: 1) we describe particular details of our approach, in particular we formally define the Promela processes for devices, units and the service clock, and 2) we significantly extend the scope of our experiments in that we describe a particular application of our approach to auto-tuning an OpenCL program for a representative example of computing problems based on reduction, namely, for computing the minimum in integer array.

The structure of this paper and its main contributions are as follows. Section 2 presents the idea of our approach as an adaptation of model checking and the technique of counterexamples to auto-tuning high-performance programs. Section 3 describes the standard platform model of OpenCL and how our abstract model of the OpenCL platform can be represented in the Promela language. Section 4 presents the use of the SPIN model checking tool for auto-tuning, i.e., finding the optimal parameter configuration of an OpenCL program. Section 5 discusses how we can adapt our general auto-tuning approach for the case of limited computation resources. Section 6 reports our extensive experiments with particular OpenCL programs and their results. In Section 7, we apply our model checking based approach to auto-tuning for a program that computes the minimal value in an integer array in parallel, and we compare results of our approach with the real execution of the program. Section 8 concludes our presentation, summarizes our findings, and provides an outline for future research.

2 Model Checking for Auto-Tuning: The Idea of Our Approach

Our main idea in adapting model checking to auto-tuning high-performance programs is based on using the general counterexample-guided approach as described, e.g., in [8]. We target multi-threaded programs that transform some input data to output data in a highly parallel manner. The task of auto-tuning such parallel programs is to find the values of performance-critical program parameters (we call them tuning parameters) that enable obtaining the result of the program in the minimum possible time.

We start outlining our approach by formulating a general optimality property as the over-time property $\Phi_{o}$ for the impossibility of program termination within a particular time interval as follows: ‘‘A parallel program cannot terminate within $T$ units of time’’, i.e., the program requires a run time that is longer than $T$ . To formalize this property, we can use a temporal logic, such that the over-time property can be formally verified in the model that represents the execution of the parallel program on a particular target architecture, using some model checking tool. If this verification shows that the property is not satisfied, this means that the computation can indeed terminate within the model time $T$ . In this case, the model checker constructs a counterexample that describes the conditions under which the program terminates within a model time not greater than $T$ , and also the corresponding configuration of tuning parameters that are responsible for the achieved program’s performance. Next, we can gradually approximate the given termination model time $T$ and check the over-time property again and again, until we reach a minimal model time $T_{min}$ of the program: if we decrease this time by only one unit of time, the model checker will prove that the $\Phi_{o}$ property is satisfied, i.e., the program really cannot terminate within $T_{min}-1$ units of time. This way, we find both the minimal model run time of the parallel program and the corresponding values of tuning parameters to achieve this run time; in other words, we have auto-tuned our program.

Refer to caption — Figure 1: The bisection method for finding a minimal termination time

For implementing this idea of auto-tuning using model checking, we should choose a suitable initial time value $T_{ini}$ and determine a procedure for moving from $T_{ini}$ towards the minimal model run time $T_{min}$ . To find a minimal termination model time, we may use, for example, the bisection method shown in Fig. 1: predicate $C_{ex}(T)$ is true iff a model checker generates a counterexample for the model and $\Phi_{o}$ property with time $T$ . The initial value of the termination model time $T_{ini}$ can be specified by simulating the program model: usually model checking tools allow for simulating model execution scenarios, such that it is possible to reveal the value of parallel program execution time in any of the scenarios.

Summarizing, our proposed method of using model checking with counterexamples for auto-tuning a parallel program proceeds in four steps, as follows:

1.

Represent the parallel program with its tuning parameters and target architecture in the language (in our case Promela) of a model checking tool (in our case - SPIN).
2.

Formulate the over-time property $\Phi_{o}$ for the inability to terminate the program execution within time $T$ in the language of a model checking tool.
3.

Search for the minimal program termination time $T_{min}$ starting from an initial time $T_{ini}$ provided by a simulation of the model constructed in Step 1.
4.

Extract information about the optimal configuration of tuning parameters from the counterexample found for the minimal time $T_{min}$ .

3 Our Modeling Approach: From OpenCL to Promela

In this section, we describe how our auto-tuning method outlined in the previous section can be implemented for our particular use case – auto-tuning of OpenCL programs using the formal modeling language Promela and the verification tool SPIN for this language. We target OpenCL, because it covers programming for practically all currently used and emerging parallel architectures, including: universal multi-core processors (CPU), many-core Graphics Processing Units (GPU), FPGA, etc.

3.1 An Abstract OpenCL Platform Model

Figure 2 shows the components of our architecture Abstract Platform based on the so-called platform model as defined in the OpenCL standard [16].

The abstract model in Fig. 2 corresponds to the currently most widespread use case of OpenCL: compute devices represent (possibly multiple) GPUs that are connected to the host CPU. The host is responsible for loading data to and from the global memory of devices. Every device includes compute units to process the data, while each unit contains processing elements. For example, the newest Ampere A100 architecture of Nvidia Corp. comprises in one device (GPU) altogether 108 compute units (called streaming multiprocessors in the Nvidia terminology) with a total number of 6192 processing elements (so-called CUDA cores).

For simplicity of modeling, we assume in our model that every abstract compute unit includes $2^{n}$ abstract processing elements that are universal calculators. A compute unit has fast local memory, to which all its processing elements have access. Data from global memory of a device can be loaded there. The difference of the access speed to local and global memory is usually between one and two orders of magnitude, depending on a particular processor.

The Promela language allows us to model all components of our abstract platform model – devices, units, processing elements, and the host process that distributes computations among devices – using different Promela processes. These processes are hierarchically linked: the host process starts devices while devices activate their units that regulate the work of their processing elements. The processes synchronize their work by using handshake message channels for launch and finish commands, as well as for reporting termination. We model the difference in the speed of access to local and global memory by using constant values that specify the memory access time. We simulate the global time in the system by a process counting the number of active processing elements to determine the moment when to increase the time. We do not take into account the communication time between platform components, which in the practice of GPU programming is significantly lower than other time delays. We describe the details of Promela implementation of the abstract platform in Section 4.

3.2 An Abstract OpenCL Program

Every OpenCL program consists of two logical parts: a host program and a kernel: the host program is executed on a CPU, and the kernel is executed in parallel by all processing elements of the device (e.g., a GPU) connected to the host.

Listing 1 shows the structure of the host program: the host compiles the kernel program at run time and organizes its work with data, in particular reserving the global device memory, copying input data into it, and uploading the result data from this memory into the host memory for subsequent processing. The device (usually GPU) performs parallel data processing by executing the kernel code simultaneously on multiple processing elements.

Listing 1: Host program in OpenCL: abstract structure

⬇

1int main (void) {

2 Initialize OpenCL

3 Compile kernel source code

4 Reserve global memory on device, copy input data to device

5 Execute kernel (instances)

6 Copy output data from the device

7 Analyze and process output data

Usually, an OpenCL program processes input arrays and outputs the results of that processing. Therefore, kernels evaluate expressions in the parallel manner, based on the available data, depending on the array index, and assign the results of the calculation to separate elements of the output array. The involved arrays are often multidimensional, with corresponding multidimensional indices.

Listing 2 shows a typical structure of an OpenCL kernel; each instance of the OpenCL kernel is executed by a so-called work item that has an identifier corresponding to the array index. Thus, work items perform the same calculations for different array elements as it is usual in the concept of data (or SIMD) parallelism [23]. If the computations of an iteration of the loop as in Listing 2 depend on the computations of another iteration, i.e., on the array index, then OpenCL-provided barriers for synchronizing computations are used. When executed on the underlying architecture, several work items are united in a workgroup, either explicitly by the programmer or automatically by a specific OpenCL implementation, such that all work items of a workgroup are executed on the same compute unit of the device.

Listing 2: Typical scheme of an OpenCL kernel program

⬇

1__kernel void abstract_kernel (__global float* N_g, __global float* R_g, int size) {

2 __local float N_l[TS];

3 int idx_g = get_global_id(0);

4 int idx_l = get_local_id(0);

5 float result = 0;

6 for (int i = 0; i < size / TS; ++i) {

7 // access to global memory

8 N_l[idx_l] = f(i, idx_l, size, N_g);

9 // waiting for local co-workers

10 barrier (CLK_LOCAL_MEM_FENCE);

11 if (b(idx_l)) // access to local memory

12 for (int k = 0; k < TS; ++k)

13 result = g1(k, idx_l, N_l, result);

14 else for (int k = 0; k < TS; ++k)

15 result = g2(k, idx_l, N_l, result);

16 // waiting for local co-workers

17 barrier (CLK_LOCAL_MEM_FENCE);

18 }

19 // copy the result of this item to global memory

20 R_g[h(idx_g, size)] = result;

21}

In the OpenCL language (which is syntactically based on C), there are four memory types: global, constant, local, and private. All work items and the host have access to global memory. Constant memory is an immutable part of global memory. Elements of one workgroup have access to local memory. Work items can also use their private memory. Since the access to local memory on the device is significantly faster than access to global memory, the kernel often specifies which chunks of data (so-called tiles) and how are loaded into local memory.

In the following, we illustrate how we can model the execution of OpenCL programs using the model checking tools in general and the Promela language in particular. For that, we define the following abstract representation of an OpenCL kernel program which transforms an input array to resulting output array by performing abstract computation with four abstract functions over integer and floating point variables.

Let us describe how the program in Listing 2 performs data-parallel computations on the elements of arrays. For clarity, the scheme of program operation is also shown in Fig. 3. For each index of an array, a separate work item calculates the local result of the kernel code execution. In our example program in the listing, the input data is array N_g of size size. The output (result) here are the elements of the R_g array. To reduce accesses to global memory, line 2 in the listing declares a local array N_l of size TS whose elements depend on the input data. Each member of a workgroup that obtains the value of its index idx_l in line 4, applies function f to calculate its value in parallel using local array N_l[idx_l] in line 8; this value can be used by all members of the workgroup, so in line 10 every workgroup member waits until the local data of all members of this workgroup are processed. This waiting is implemented by the synchronization operator barrier. Further in lines 12-14, the iterative calculation of the result depends only on the local data and the index of the group element. In addition, depending on the index, these calculations can be performed in different ways: in line 11, boolean function b(idx_l) controls the calculation options (function g1 or g2). In line 17, the work item waits for the completion of the calculations of all co-workers with the current local data, and at the next iteration of the loop in line 6, local data are updated based on the next portion of global data. When input global data are fully processed, the result of the work item is saved in global memory in array element R_g, the index of which depends on the size of the input data and on the idx_g global index received by the work item in line 3.

Listing 2 is a deliberately simplified version of an OpenCL kernel. In general, arrays may be multidimensional; there may be several input arrays, and, in addition, there may be other input variables and constants located in global memory; the output can comprise multiple arrays; there are no changes in global memory that require synchronization of all processes in all workgroups. Of course, the pseudocode in Listing 2 can be easily enriched with these details without going beyond the scope of its functions and operators and, thus, our approach is applicable in the general case.

Our Promela-based approach relies on the fact that Promela, like OpenCL, is close to the C language. Hence, we can translate OpenCL programs into the Promela language taking into account the following restrictions. As the SPIN verification framework assumes an abstraction from the computational aspects of programs and is intended to check the interaction, synchronization and coordination of parallel processes, all data in the Promela model must be of a finite type. Moreover, when modeling the abstract OpenCL kernel in Promela, we abstract from specific calculations. To find the optimal parameter values, we only take into account the time of these calculations, which depends on the number of accesses to the global and local memory and the ratio of them. Therefore, computations in lines 8, 12, 14, and 20 of Listing 2 are replaced in the Promela model of the kernel by the code that implements time ticking required for these computations. For the same reason, we ignore the parameters and local variables of the kernel that specify the content of the data for calculations (lines 1-5). This abstraction can be refined for a particular program and hardware by estimating run time for every function used in the program: we take into account the time of all primitive operations in the function and the ratio of the global memory access time to the local memory access time (these characteristics are usually known from the hardware specifications). However, the amount of these calculations which depends on size must be taken into account too. Hence, the code in Listing 2 is representative enough up to the number of loops and program control structures, because for modeling massive parallel computations, only the size of the input data matters.

An important aspect that we take into account in our Promela-based modeling is the synchronization of workgroup elements relative to changes in local and global memory. To provide local synchronization, our model introduces barrier processes that are responsible for separate workgroups, while for global synchronization we introduce a barrier process for all work items. In our example kernel, there is no global barrier, but its implementation in Promela is quite similar to the implementation of the local barrier. The role of the host program in our Promela modeling is reduced to the process of ‘‘compiling’’ the kernel, i.e., distribution of computations on a given device. We present more details of the Promela implementation of the execution of both OpenCL kernel and its host program in the next section, which also describes the application of model checking to optimize the performance of OpenCL programs.

4 Using SPIN for Auto-Tuning OpenCL Programs

This section describes in detail how we implement the four steps of our counterexample-based approach (Section 2) in order to auto-tune OpenCL programs using the model-checking tool SPIN and its language Promela. Our modeling of OpenCL programs execution is based on the standard OpenCL semantics [16].

The OpenCL semantics prescribe that the OpenCL compiler allocates a host (CPU) connected to a device (CPU, GPU, or FPGA) for the execution of an OpenCL program. We assume that the device integrates $m$ units, and the work items of one workgroup are executed on one unit. Each work item in a workgroup is sequentially executed on one processing element of the unit: altogether $2^{mn}$ work items are executed simultaneously, with one instruction performed in one clock cycle. Since the Abstract OpenCL Platform Model does not care about warps (groups of work items for cooperative scheduling in hardware), our Promela model also does not consider warps. Our method can be customized for a particular hardware with its special warp policy by changing the orchestration of processing elements by their unit. Kernel code may contain conditional operators depending on the index of the work item. Without warp scheduling, different branches of such operators are simultaneously executed by the corresponding processing element performing the work item. Work items can use data from both local unit memory and global device memory, but local memory data is not available to the members of other workgroups running on other units.

The key aspect of auto-tuning are the tuning parameters that affect the performance of an OpenCL program. These parameters usually depend on size – the size of the program’s input data; the number of work items also depends on the size of the data. In this paper, we consider the following tuning parameters:

•

WG — the workgroup size (defined in the host program). With an optimally selected workgroup size, all processing elements of all units are fully and evenly loaded, which clearly leads to an improvement in the total computation time.
•

TS — the tile size (defined in the kernel). The optimal choice of the size of data blocks periodically loaded into the fast local memory minimizes the number of calls to slow global memory, which also leads to faster execution.

There exist practical cases when we may have several tile sizes (e.g., when the computation uses matrices of different size), but different sizes of workgroups are usually not required. Our Promela model for computations chooses tuning parameters in a non-deterministic manner by randomly selecting them in the range depending on the input data size. Obviously, the more non-determinism is in the model (in particular with an increase in the number of tuning parameters), the more resources are necessary to perform the model checking.

Summarizing, our auto-tuning problem is to find the optimal workgroup size WG and tile size TS for the abstract kernel abstract_kernel (Listing 2) and its host program, executed on the Abstract Platform as described in Section 3.1. For this problem, we now describe the particular steps of our Promela-based approach that relies on finding a counterexample, as outlined in Section 2.

Step 1 of the Counterexample Method

The first step is the most time-consuming in our method as it is necessary to take into account all details of executing an abstract parallel OpenCL program on an abstract OpenCL platform. We define the following parallel Promela processes in $Promela$ $Abstract$ $Model$ for modeling this execution:

•

initial process main selects the values for the tuning parameters and starts the host and clock processes;
•

process host activates several processes device;
•

every process device launches its subordinate processes unit;
•

every process unit starts processes implementing its processing elements pex and their local memory barrier;
•

every process pex performs computations for the abstract_kernel;
•

every process barrier locally synchronizes processing elements pex;
•

service process clock implements global time counting.

Fig. 4 shows the sequence and communication diagram for these processes. Pairs of processes form master-slave structures: (host, device), (device, unit), (unit, pex), and (unit, barrier). The master starts (run), activates (go) and stops (stop) its slave processes. Slave processes report about termination (done). The processes communicate through handshake channels. In the diagram, we show only one instance for each process. Barrier synchronization is implemented using messages, by waiting for them to be sent by a predefined number of processes. Time is incremented in process clock after the completion of a computation step by all processing elements. Next, we describe our modeling design step-by-step in detail.

In Listing 3, we show how process main selects the values for tuning parameters WG and TS and starts processes host and clock. The number of devices ND, the number of units NU per device, and the number of processing elements NP per unit are global constants declared at the beginning of the model description, as well as factor GMT for the time of computations using the global memory in process pex. In our model, the number of running Promela processes is made up of one main process, one clock, one host, no more than ND devices, no more than NU*ND units, no more than NU*ND barriers, and no more than NU*ND*NP processing elements. We assume for simplicity that data size size is a power of 2. Therefore, the numbers WG and TS are also some random powers of 2. These powers are chosen in lines 14 and 16. Since Promela does not support exponentiation, randomly selected numbers can be obtained by appropriate bitwise shift of size. The number NWD of devices that should be started by the host (lines 14, 16) and the number NWU of units that should be started by every device (line 18) both depend on the number WGs of workgroups. The number NWE of processing elements that should be started by every unit depend on the size WG of a workgroup (line 20). These assignments minimize the number of simultaneously executing processing elements (line 22). In this process, we specify the current value of computation time T (line 23). In line 24, the model of parallel computations starts.

Listing 3: Promela-process for selecting parameters and launching computation

⬇

1active proctype main() {

2 byte i;

3 byte n = 10;

4 size = 1 << n; // size = 2^n

5 // workgroup size selection

6 select (i : 1 .. n-1);

7 WG = size >> (n - i);

8 // tile size selection

9 select (i : 1 .. n-1);

10 TS = size >> (n - i);

11 // number of working groups

12 WGs = size / (WG * TS);

13 // number of working devices

14 NWD = (WGs <= NU * ND -> (WGs / NU) : ND);

15 // if WGs <= NU

16 NWD = (WGs / NU -> NWD : 1);

17 // number of working units

18 NWU = (WGs <= NU -> WGs : NU);

19 // number of working elements

20 NWE = (WG <= NP -> WG : NP);

21 // total number of working elements

22 allNWE = NWE * NWU * NWD;

23 T = 100;

24 atomic {

25 run host();

26 run clock();

27 }

28}

Listing 4 shows process host that activates NWD processes device in lines 6, 8. If the number of workgroups WGs is relatively small then it waits for their termination (line 11) and marks the termination of the target parallel computations in line 23 by variable FIN. Otherwise, this process reactivates the devices until all workgroups are served (line 17).

Listing 4: Promela-process for the host-program

⬇

1proctype host () {

2 chan d_hst = [0] of {mtype : action};

3 chan hst_d = [0] of {mtype : action};

4 FIN = false;

5 atomic {

6 for(i : 0..NWD - 1) { run device(d_hst, hst_d); }}

7 atomic {

8 for(i : 0..NWD - 1) { hst_d ! go; }}

9 if

10 :: WGs <= (NU * ND) -> atomic {

11 for(i : 0..NWD - 1) { d_hst ? done; hst_d ! stop; }}

12 :: else ->

13 for(i : 0..WGs / (NU * ND) - ND) {

14 atomic {

15 d_hst ? done;

16 allNWE = allNWE + NWE * NWU;

17 hst_d ! go; }}

18 for(i : 0..ND - 1) {

19 atomic {

20 d_hst ? done;

21 hst_d ! stop; }}

22 fi

23 FIN = true;

24}

As an extension of our previous paper [22], we describe here in detail processes device and unit. Process device (Listing 5) starts its NWU instances of process unit in line 4, activate them in line 7 and awaits their termination in lines 11, 15, or 18, depending on the number of workgroups, like the unit process above waits their pexes. If the slave unit does not suppose to be reactivated, then the number of simultaneously working processing elements is decreased (lines 11 and 18). The device also reports about termination to its master – the host process (line 20). After stopping, the device stops all its units (line 22).

Listing 5: Promela-process for a device

⬇

1proctype device (chan d_hst; chan hst_d) {

2 chan dev_u = [0] of {mtype : action};

3 chan u_dev = [0] of {mtype : action};

4 atomic{ for(i : 0..NWU - 1) { run unit(dev_u, u_dev); }}

5 do

6 :: hst_d ? go ->

7 atomic{ for(i : 0..NWU - 1){ dev_u ! go; }}

8 if

9 :: WGs <= NU ->

10 atomic { for(i : 0..NWU - 1) {

11 u_dev ? done; allNWE = allNWE - NWE; }}

12 :: else ->

13 atomic {

14 for(i : 0..WGs - 1) {

15 u_dev ? done; dev_u ! go; }}

16 atomic {

17 for(i : 0..NU - 1) {

18 u_dev ? done; allNWE = allNWE - NWE; }}

19 fi

20 d_hst ! done;

21 :: hst_d ? stop ->

22 atomic{ for(i : 0..NWU - 1) { dev_u ! stop; }}

23 break;

24 od

25}

The process unit (Listing 6) starts its NWE processing elements pex and their local memory barrier (lines 6-8). Then upon receiving the activating signal from its master device, the unit process activate its pexes (line 11) and then waits termination of their work (line 14). If the size of every work group WG is less then the number NP of slave processing elements of the unit, after receiving reports from all activated pexes, the process unit goes to report about the termination to its master device in line 23. If it is not the case, the unit again activates the just finishing processing elements in line 17 until all all work items in the workgroup have been completed. Upon receiving the stop signal from its device, the unit process stops its barrier and all processing elements (lines 25-26).

Listing 6: Promela-process unit for abstract unit

⬇

1proctype unit (chan dev_u; chan u_dev) {

2 chan pex_b = [0] of {mtype : action};

3 chan b_pex = [0] of {mtype : action};

4 chan pex_u = [0] of {mtype : action};

5 chan u_pex = [0] of {mtype : action};

6 run barrier (pex_b, b_pex);

7 atomic { for (i : 0..NWE - 1) {

8 run pex(i, pex_b, b_pex, pex_u, u_pex); }}

9 do

10 :: dev_u ? go ->

11 atomic { for(i : 0..NWE - 1) { u_pex ! go; }}

12 for(i : 0..WG - 1) {

13 atomic {

14 pex_u ? done;

15 if

16 :: WG <= NP -> skip;

17 :: else -> u_pex ! go;

18 fi }}

19 if

20 :: WG <= NP -> skip;

21 :: else -> atomic { for(i : 0..NP - 1) { pex_u ? done; }}

22 fi

23 u_dev ! done;

24 :: dev_u ? stop ->

25 pex_b ! stop;

26 atomic { for (i : 0..NWE - 1) { u_pex ! stop; }}

27 break;

28 od

29}

Listing 7 shows the local synchronization process barrier: it synchronizes processing elements of one unit using the pex_b and b_pex channels. It receives a message from the pex processes about the suspension of their work and counts in the i variable the number of processes that are waiting for the termination of processing the local memory of other members of the group currently performing calculations (line 4). When it is equal to the number of processing elements NWE launched by its master unit, the barrier is considered passed and in line 5 the the barrier process allows the processing elements to continue computations.

Listing 7: Promela-process barrier for local synchronization

⬇

1proctype barrier (chan pex_b; chan b_pex) {

2 do

3 :: pex_b ? done ->

4 for(i : 1..NWE - 1) { pex_b ? done; }

5 atomic { for(i : 0..NWE - 1) { b_pex ! go; }}

6 :: pex_b ? stop -> break;

7 od

Listing 8 shows in detail the work item process pex that performs computations of an instance of the kernel code abstract_kernel presented in Listing 2. The total number of these processes that are created in the Promela model depends on the size of the input data. The number of simultaneous pex processes depends on the number of units, which is significantly smaller than the total number of processes. Each of them is started by its own unit, whose Promela process is described in [24].

Listing 8: Promela-process pex for a processing element

⬇

1inline long_work (gt, tz) {

2 do

3 :: time > (start_time + (gt * tz)) -> break;

4 :: else -> atomic {

5 cur_time = time;

6 NRP_work++;

7 time == cur_time + 1;} //wait

8 od

10proctype pex (byte me; chan pex_b; chan b_pex;

11 chan pex_u; chan u_pex) {

12 do

13 :: u_pex ? go ->

14 atomic { start_time = time; cur_time = time; }

15 for(i : 0..(size / TS - 1)) {

16 long_work(GMT, TS) // access to global memory

17 pex_b ! done; b_pex ? go; // waiting for local co-workers

18 start_time = time;

19 if // ’if’ access to local memory

20 :: me % 2 -> long_work(1, TS)

21 // ’else’ access to local memory

22 :: else -> long_work(1, TS)

23 fi

24 pex_b ! done; b_pex ? go; // waiting for local co-workers

25 }

26 start_time = time;

27 long_work(GMT, 1) // copy the result to global memory

28 pex_u ! done;

29 :: u_pex ? stop -> break;

30 od

31}

Process pex is connected by the handshake synchronization channels pex_b and b_pex with the local group barrier, and pex_u and u_pex with the master unit. In these channels, the work item receives start and stop commands, and also informs about the termination of computations. In our model, we abstract from the specific computations that the kernel performs, so we only use the computation time. We also assume that a computation in local memory takes one time unit, and a computation using global memory takes GMT units of time. This abstraction can be straightforwardly refined for a particular program and hardware.

Thus, in process pex we model only the number of computational steps performed by the kernel (the for loop in line 15), depending on the size of the input data and calls to global memory (lines 16 and 27) and to local memory (lines 19-23). Upon termination of the computation step, pex reports this event to the process that implements global time by increasing the counter of the currently running NRP_work processes (line 6 of inline macro long_work). Note that due to the blocking semantics of the Promela language, the process can proceed to the next stage of its computation only when the global time time is increased by 1 (line 7 of long_work). The ‘‘long work’’ modeling of computations of the abstract kernel function f, g1, g2 and h finishes after gt*tz time units, depending on the type of memory accessed. Synchronization on the local barrier occurs twice, as in the original kernel: line 10 of the kernel (Listing 2) corresponds to line 17 of the model (Listing 8), and line 17 corresponds to line 24. Process pex terminates (line 28) after writing its result into global memory.

Service clock process clock implements global time counting. This process increments the global time counter variable when all running pex processes (their number allNWE) reports by increasing the NRP_work shared variable that currently they are computing. The clock process stops when the global variable FIN is true. The time value at this moment is the time taken for the computation.

Listing 9: Promela-process for global time

⬇

1proctype clock () {

2 do

3 :: FIN -> break;

4 :: allNWE != 0 && NRP_work == allNWE ->

5 atomic { NRP_work = 0; time++; }

6 od

Step 2 of the Counterexample Method

In specializing the over-time property $\Phi_{o}$ for the auto-tuning problem, we use the value of variable FIN that marks the end of calculations, and the final value time. The over-time formula without superindex is used in the model checking approach to general auto-tuning problem, while this formula with superindex ‘ $p$ ’ is specialized for our method which uses Promela model. The SPIN property specification language is temporal logic LTL, such that we can write:

\Phi_{o}^{p}=\textbf{G}(\texttt{FIN}\rightarrow(\texttt{time}>T)),

which corresponds to the statement ‘‘Always when the parallel program terminates, its execution time is greater than $T$ .’’

Step 3 of the Counterexample Method

The third step of our method finds the minimal time for which a parallel computation program can terminate. This step begins with launching the SPIN verifier with the constructed Promela Abstract Model and the formula $\Phi_{o}^{p}$ for some value of time $T$ . We then decrease the value of $T$ until the SPIN stops generating counterexamples, i.e., until it agrees that the program cannot be executed in a time shorter than the final $T$ . In our approach, we find the execution path with minimal model time. This path may not be the shortest one due to many model steps which do not increase the value of variable time. We indeed came across such paths in our experiments. The initial value of $T$ can be found using the simulation mode in SPIN. During simulation, SPIN reproduces one of the finite scenarios of the system behaviour, fixing the values of the variables used in the model at the end of the simulation. Therefore, we can use value time which corresponds to the end of the program in the simulated scenario. To decrease the value of $T$ in the next SPIN runs, we may use the bisection method shown in Fig. 1. The final counterexample gives the minimal model time value.

Step 4 of the Counterexample Method

The final step of our approach is to analyze the last counterexample to extract the optimal configuration of tuning parameters. For counterexample analysis, SPIN provides running a simulation corresponding to the counterexample. In the auto-tuning problem, there is no need to search for the optimal computation path. Therefore, there is not necessary to analyze the transitions of the final counterexample. To solve our problem, we only need to extract the values of the tuning parameters WG and TS, which are known in the final counterexample simulation. The final values of the tuning parameters WG and TS form the optimal parameter configuration we are looking for.

5 Adaptation for Limited Computation Resources: Using Swarms

As our solution requires a large number of interactions through channels, its straightforward implementation does not scale well: with an increase in the size of the input data, the memory required for the verification process in Step 3 begins to grow and may exceed the available physical memory of the computer, which we sometimes observed in our experiments.

We address the problem of limited resources by using the swarm model checking method proposed in [25, 26]. This method uses state hashes and therefore does not visit all possible states of the system. The swarm model checking starts several different verification tests checking the specified property during bounded time and with a bounded number of model steps. Similar to the standard model checking method, it generates counterexamples when one of the paths violates the property. The search of such a path can be parallelized across processor cores and even performed on different network nodes. In [27], we already applied this approach to the automatic solution of various puzzles and practical problems.

In order to use swarm model checking for alternative non-bisection searching, we can formulate the non-termination property $\Phi_{t}$ for the impossibility of program termination as follows: ‘‘A parallel program cannot terminate’’ (LTL formula $\Phi_{t}=\textbf{G}(\neg\texttt{FIN})$ ). We definitely know that our parallel programs that compute some output data must terminate. Hence, we expect that there is at least one counterexample which demonstrates that the program terminates and provides the termination model time. Swarm model checking produces several termination model times from several counterexamples, from which we can choose the minimal termination model time. To be sure that there is no smaller time, we launch successively swarm model checking with a decreasing model time and over-time property $\Phi_{o}^{p}$ in the following manner: we stop searching for the minimal termination model time if swarming with the current model time provides no results within the previous swarm execution time or this swarming finds no smaller times. In other words, the criterion for stopping the search for the optimal time is the ability of the SPIN swarm to find counterexamples, rather than the number of such findings. If the swarm does not find a counterexample as quickly as at the previous swarm launching, the counterexample with a smaller time value does not exist with very high probability. In future work, we plan to formally prove that the configuration of tuning parameters found by the swarm is indeed optimal.

Our algorithm in Figure 5 implements the described search strategy with two functions: Min_time_Swarm(F) and Exe_time_Swarm(F) for the minimal model time found by the swarm for formula F and the execution time of this swarm, respectively. Predicate Swarm(F,exe_time) is true iff the swarming formula F finishes within exe_time. Formula F_t is for the non-termination property, and formula F_o(T) is for the over-time property with time T.

In our implementation, we have developed a script that simulates each counterexample found by the swarm and finds values of time, TS and WG in the output. We also provide a simple program which, after the script has finished, sorts these counterexample results by time values with corresponding tuning parameters TS and WG.

We make an assumption (which corresponds to the situation in practice) that every device and every unit of a device work in exactly the same manner, i.e., it takes exactly the same time (and other resources) to perform computation of the same complexity. Hence, to estimate the computation time, we can abstract from the number of devices and the number of units in these devices and consider just one device and one unit with all its processing elements. Note that we cannot consider just one processing element because, in general, they compute different functions of different complexity depending on their local identifiers while the devices and units synchronize their work independently of the computation complexity. With this abstraction, the number of Promela processes is made up of one main process, one clock, one host, one device, one unit, one barrier and several processing elements. In the model used in our experiments [24], the total number of Promela processes is 10. Here, the only really large parameter is the input data size.

6 Experiments with the Promela Abstract Model

In Table 1 we show the results of a series of experiments conducted with the SPIN model checker (version 6.5.1). During the tests, we studied different sizes of input data for a platform architecture with the following parameters: one device, one unit and four processing elements.

Table 1: Resulting parameter values and memory consumption measured in the experiments

N	Size	Model time	Steps	TS	WG	Memory usage (exhaustive mode)	Memory usage (swarm mode)	Verifi-ca-tion time	1st trail in	1st trail optimality
1	8	44	1658	4	4	0.280GB	-	2s	1s	84%
2	16	156	5673	4	8	2.081GB	-	25s	1s	65%
3	32	584	21011	4	16	14.767GB	115MB	25m/ 1h	7s/ 1s	79%/ 77%
4	64	2224	71495	8	32	-	128MB	1h	1s	83%
5	128	9344	267119	64	64	-	172MB	1h	1s	94%
6	256	36234	1300634	4	4	-	1.145GB	2h	3s	94%
7	512	142090	5099397	4	4	-	2.367GB	2h	3s	99%
8	1024	549912	15973533	32	16	-	16GB	4h	3m	99%

We observe in the table of results that, up to input data size 32, exhaustive verification with over-time formula $\Phi_{o}^{p}$ was possible for our machine (an M1-based MacBook with 16Gb of RAM).

This auto-tuning finds optimal values of parameters with the shortest simulation time (parameter which is also calculated in our model). For reference, we include model steps, shown in SPIN simulation mode for the best counterexample. When the data size is significantly larger, we may hit the 16 GB memory threshold. Therefore, we run the swarm verification with non-termination formula $\Phi_{t}$ on 1-8 cores for given maximum verification time threshold, and the swarm process is now able to find a large number of counterexamples with trails to the final state FIN for larger data sizes. We also developed a runner script to process all the trails found by SPIN. It simulates a particular trail, obtains the simulation time and TS/WG parameters from the simulation output, sorts them by time and required steps. Using such a script finally helps to obtain the minimal time and optimal parameters from a bunch of trails. Getting multiple trails is possible due to -e parameter (create trails for all errors) during the verification and swarm run. We also had to increase maximum search depth depending on the input size with the last one set to $2\times 10^{8}$ by option -m.

In Table 1, ‘‘Model time’’ is the time that is calculated in the Promela model (variable time), and ‘‘Verification time’’ is the time spent by the SPIN verifier or its swarm processes. In the last two columns of the table, we show the time required to obtain the first counterexample with a trail – the quickest sub-optimal solution of the problem (both using exhaustive and swarm methods) and its optimality as the ratio of the optimal solution modeling time to the time of the first counterexample (if the times coincide, we also take into account the number of steps spent).

Therefore, we can conclude that our method is feasible in practice: for small dimensions, we can prove that our interacting system terminates in the correct order. Once we hit the memory limits in the model checker, we can use the swarm method and get the approximate model time and target parameters with a full trace of execution after a series of quick model runs. Of course, this method is not yet applicable for modeling operations with really big data. However, we initially set ourselves goals to consider modeling with an abstraction from operations and data, since we are primarily interested in the sequence of their processing.

7 Application Use Case: Auto-Tuning the Minimum Problem

In this section, we demonstrate how our approach to auto-tuning based on model checking works for a particular application use case. Our use case is the problem of computing the minimal value in a very big integer array, we call it the Minimum problem.

We offer an OpenCL kernel program for the Minimum problem, run and tune parameters for it using Nvidia GPU P104-100 of the Pascal architecture. Then we show how the general Promela model defined in Section 4 can be modified to model a particular case of this parallel computation problem. Ultimately, we apply our method of counterexamples and thereby find the optimal tuning parameters for the Minimum problem.

7.1 OpenCL Program for the Minimum Problem

The Minimum problem belongs to the practically relevant class of computing problems which compute a single result for an input array; they are also known as reductions. Such problems are popular in teaching parallel programming of parallel programs for massively parallel message systems like MPI. On the other hand, they are also well suitable for parallel computing on massively parallel systems like GPU. The usual approach is to first scatter parts of the array over compute units, which then further divide them and compute the minimum in their part, after which it is necessary to collect all parts and calculate the global minimum.

The solution scheme is shown in Figure 6 and presented in Listing 10 for the kernel and Listing 11 for the host. In our implementation, we work on a grid of the total number of processing elements and this number is set from the host program as $global\_item\_size$ (see host, line 10). The current work item number is available from the kernel as $get\_global\_id(0)$ . The total number of processing elements is divisible into workgroups each by $get\_local\_size(0)$ , which is referred to $WG$ in our programs. We can consider all processing elements as a two-dimensional array with coordinates available as $(get\_global\_id(0)/get\_local\_size(0),get\_local\_size(0))$ (see kernel, lines 3 and 4). At the same time, according to the first coordinate, we can assume that some abstraction on smart computing units is done here, where each of them has local memory and $WS$ child processing elements during computations.

In general, each of the processing element receives a portion of data of size $TS$ , while it can be set relative to the splitting of all data into equal parts or relative to the number of such splits and then recalculated into array elements. Then the parallel algorithm finds all minima for each calculator in its portion of data of size $TS$ (kernel, lines 7-9) and saves them in local memory. Further, all found minima are compared started from zero cells of local memory (kernel, lines 12-16) and then unloaded into global memory, where the final reduction is carried out for simplicity on the host (host, lines 19-24). The full program for the minimum problem is available in [24].

Listing 10: The OpenCL kernel for the Minimum problem

⬇

1__kernel void findMinValue(__global int* glob, __global int* mins, __local int* loc, int TS) {

2#define MIN(a, b) if (a > b) a = b;

3 int my_unit = get_global_id(0) / get_local_size(0);

4 int my_elem = get_local_id(0);

5 loc[my_elem] = INFINITY;

6 // MAP

7 for (int i = 0; i < TS; i++) {

8 MIN(loc[my_elem], glob[i + get_global_id(0) * TS])

9 }

10 barrier (CLK_LOCAL_MEM_FENCE);

11 // REDUCE local

12 if (my_elem == 0) {

13 mins[my_unit] = loc[0];

14 for (int i = 1; i < get_local_size(0); i++) {

15 MIN(mins[my_unit], loc[i])

16 }

17 }

18}

Listing 11: The host code for the Minimum problem

⬇

1int main(…) {

2 …

3 size_t units = …; // Should be near the value of SM of the card

4 size_t global_item_size = units * WG; //Total number of calcs

5 size_t local_item_size = WG; //Work group size

6 size_t TS = …; //Tile size

7 size_t data_size = …;// clArray[data_size] is our Big Data

8 …

9 clSetKernelArg(kernel, 0, sizeof(cl_mem), &clArray);

10 clSetKernelArg(kernel, 1, sizeof(cl_mem), &clMins);

11 clSetKernelArg(kernel, 2,

12 sizeof(cl_uint) * local_item_size, NULL);

13 //Run the kernel on the grid (global_item_size x local_item_size)

14 clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global_item_size,

15 &local_item_size, 0, NULL, NULL);

16 TS = data_size / (units * WG); //Chunk of data per TS

17 clSetKernelArg(kernel, 3, sizeof(cl_uint), &TS);

18 …

19 clEnqueueReadBuffer(queue, clMins, CL_TRUE, 0, units *

20 sizeof(cl_uint), mins, 0, NULL, NULL);

21 int min = INT32_MAX;

22 // REDUCE global

23 for (int i = 0; i < units; i++) {

24 if (mins[i] < min) min = mins[i]; //Total minimum = min

25 }

26}

Let us now discuss the choice of optimal parameters for running the kernel on particular hardware. We are going to conduct a series of experiments on the GPU P104-100 to find the minimal element in a 4GB array using the proposed kernel, while changing the values of the tuning parameters in nested loops. The key features of our target platform are described in the white paper on the Pascal architecture [28], and its current configuration is presented, for example, in [29].

Our target GPU has a total of 1920 processing elements (or CUDA cores according to Nvidia terminology). Respectively, the total number of elements in the grid of the OpenCL program (how many instances of our kernel will be launched) should be commensurate with this value. We should also pay attention to the number of streaming multiprocessors (SM), which correspond to the compute units in the OpenCL platform; here there are 15 of them. Analyzing their structure, we can see that in general, inside each SM there are 128 processing elements for different purposes, which actually form the given configuration of CUDA cores (1920 = 128 * 15). In addition, there is also 48Kb L1 local memory per each SM. From this we can conclude that in order to effectively execute our kernel on our hardware platform, we need to set the following initial configuration: units = 15; WG = local_size = 128, global_size = 15 * 128. However, it should be noted that this launch configuration is only the initial one: some processing elements may remain idle while waiting for data. The task of the OpenCL compiler and the runtime environment is to efficiently use all available resources for a given kernel and launch configuration, and our task is to select such parameters as a result of several launches, auto-tuning or exemplary models, such as ours in Promela.

7.2 From the Minimum OpenCL Kernel to its Promela Model

In this section, we describe how we can obtain a Promela model for the minimum problem, as the first step of our auto-tuning approach based on model checking with counterexamples. We should adapt all Promela processes described in Section 4 except clock and host processes, as follows below. The full Promela model for the minimum problem is available in [24].

First, we consider the service main process. Before selecting optimization parameters (from line 3, Listing 3), the process loads data to global memory and updates the local memory of all units with maximal values MAX of data type. Arrays glob and loc are defined as global Promela variables.

Listing 12: A part of main Promela-process in the Minimal model for defining initial data

⬇

1active proctype main() {

2…

3 // loading input data to global memory

4 for(i : 0 .. size-1) { glob[i] = size - i; }

5 // preparing local memory

6 for(i : 0 .. LM*NU-1) { loc[i] = MAX; }

7 // workgroup size selection

8…

The host process in case of the Promela model for the minimum problem uses only one device and does not operate on global memory, so we do not have to change it. There is also nothing to change in the service clock process.

The device process must count the number nwg of each workgroup launched on one compute unit and send it to its compute units in order to correctly indexing global memory in processing elements. Hence, we change the type of channel dev_u: instead of just saying ’go’ to units (line 11 in Listing 5), a device computes and sends information about the number of workgroups in lines 9-10, Listing 13.

Listing 13: A part of device Promela-process in the Minimal model for sending workgroup numbers

⬇

1proctype device (chan d_hst; chan hst_d) {

2 chan dev_u = [0] of {byte, mtype : action};

3 …

4 for(i : 0..WGs-1) {

5 atomic { u_dev ? done;

6 dev_u ! nwg, go;

7 num++;

8 if

9 :: num >= NU -> nwg++; num = 0;

10 :: else -> skip;

11 fi }}

12 …

13}

For correct indexing of local memory in the Promela model, the unit process must inform its processing elements about its number mu provided by its device. So we change the corresponding proctype definitions and run commands (line 6, Listing 14). Beside the number of a workgroup received from its device (line 8, Listing 14), the unit must send to each its processing element information about the number iter of launched work items assigned to it within one group computation for correct global indexing (lines 9 and 20, Listing 14). We correspondingly change a type of channel u_pex (line 3, Listing 14). If a processing element does not finish compute assigned work items, the unit launch it again (line 20, Listing 14). After finishing, all processing elements report about it in line 32, Listing 14.

Listing 14: Promela-process unit in the Minimal model

⬇

1proctype unit (byte me, chan dev_u; chan u_dev) {

2 …

3 chan u_pex = [0] of {byte, mtype : action};

4 run barrier (pex_b, b_pex);

5 atomic { for(i : 0..NWE-1) {

6 run pex(i, me, pex_b, b_pex, pex_u, u_pex); }}

7 do

8 :: dev_u ? nwg, go ->

9 atomic { for(i : 0..NWE-1) { u_pex ! nwg, gowg; u_pex ! 0, go; }}

10 if

11 :: WG <= NP -> atomic { for (i : 0..NWE-1) { pex_u ? ewg; }}

12 :: else ->

13 iter = 1;

14 num = 0;

15 for(i : 0..WG-NP-1) {

16 atomic {

17 pex_u ? status;

18 if

19 :: status == end_wg -> skip;

20 :: else -> u_pex ! iter, go;

21 fi

22 num++;

23 if

24 :: num >= NP -> iter++; num = 0;

25 :: else -> skip;

26 fi }}

27 for(i : 0..NWE-1) { pex_u ? ewg; }

28 fi

29 u_dev ! done;

30 :: dev_u ? 0, stop ->

31 atomic { for(i : 0..NWE-1) { u_pex ! 0, stop; }}

32 atomic { for(i : 0..NWE-1) { pex_u ? done; }}

33 pex_b ! stop;

34 break;

35 od

36}

There is exactly one local memory barrier at the end of computation of every unit. This barrier works for one processing element only (the first element of the unit), so, in line 5 of barrier process (Listing 7) we replace the collective launching by individual launching by removing the for loop.

The processing element pex uses local information about its local id and its unit id from its proctype definition and information received from its unit about a current workgroup number nwg and iteration iter (lines 6, 8, Listing 15) in order to index and compare local and global memory data and save the minimum in the corresponding slot of the local memory (line 15, Listing 15). Comparing with abstract pex, at this stage of the minimum problem the process does not care about local computations of its co-workers, so it does not need a barrier. Every processing element pex must inform its unit about the end of processing assigned work items by sending the message in line 18, Listing 15. When the processing element with zero local id have received from its master unit information about the end of computation (line 9, Listing 15), it reduces the collective result from the local memory (lines 14-16, Listing 15) and saves it to the first slot of global memory (line 17, Listing 15).

Listing 15: Promela-process pex in the Minimal model

⬇

1proctype pex (byte me; byte mu; chan pex_b; chan b_pex; chan pex_u; chan u_pex) {

2…

3byte myloc = me + mu*NP; // my place in local memory

5do

6:: u_pex ? nwg, gowg ->

7 do

8 :: u_pex ? iter, go ->

9 atomic { start_time = time;

10 cur_time = time; }

11 glob_id = (WG > NP -> nwg*WG + me + iter*NP : nwg*WG + me);

12 shift = glob_id * TS;

13 // MAP

14 for(i : 0 .. TS-1) {

15 min(loc[myloc], glob[i + shift])

16 long_work(GMT) } // access to global memory

17 if

18 :: (iter+1) >= WG/NP -> pex_u ! end_wg; break;

19 :: else -> pex_u ! n_end_wg;

20 fi

21 od

22:: u_pex ? 0, stop ->

23 pex_b ! done;

24 if

25 :: me == 0 ->

26 b_pex ? go; // waiting for local co-workers

27 // REDUCE local

28 for(i : 1..(NWE-1)) {

29 min(loc[myloc], loc[myloc + i])

30 time++; } // access to local memory

31 // copy the result of this working group to global memory

32 min(glob[0], loc[myloc])

33 time = time + GMT;

34 :: else -> skip;

35 fi

36 pex_u ! done;

37 break;

38od

39}

We can obviously use the outlined adaptation technique for a broad variety of computing problems based on reduction, with different base operations (which in our use case is the binary min operation).

7.3 The experimental results for the Minimum Problem

In this section, we present the experimental results for two possible ways of finding optimal configuration of tuning parameters for the use case of Minimum problem. The first way is finding the optimal parameter values manually by running the parallel implementation on the particular architecture (in our case — the Nvidia P104-100 GPU) for different combinations of parameter values and comparing the run times, i.e., this is the way still often used by application programmers in practice. The second way of finding the optimal configuration of the tuning parameters is applying our approach presented in this paper: using the Promela model of the OpenCL code and looking for a counterexample by means of the model checking tool SPIN.

In the manual approach, a series of tests were conducted to find the minimum for the same data size, but with different kernel startup parameters. The time and speed of processing data of a given length were considered when the clEnqueueNDRangeKernel (kernel launch) and clEnqueueReadBuffer (getting an array with minima for reduction on the host) functions worked on previously prepared data.

Table 2: Experiments with the Minimum kernel

N	Global size	WG	TS	Time, ms	Bandwidth, Gb/s
1	960	64	64	140	28.40
2	960	64	128	140	28.40
3	960	64	256	140	28.40
4	1280	64	128	156	25.60
5	1280	64	256	156	25.60
6	1600	64	64	156	25.61
7	1920	128	64	137	29.18
8	2560	128	64	140	28.46
9	3200	128	64	124	32
10	5120	512	64	124	32
11	6400	256	64	109	36,57
12	7680	512	64	93	42.67

Table 2 shows our experimental results. We observe that, in all data series, an increase in TS by 2 or 4 times does not lead to a change in the array processing speed. At the same time, increasing the size of the WG (which generally increases the overall number of the calculators) increases performance. This can be explained by the fact that, with an increase in the WG size, a larger amount of local instead of global memory is used by Streaming Processors for internal reductions. In addition, since our use case is not computationally complex, some processing elements can be idle at a time, and if there are more tasks, the thread schedulers in the compute units can load them more efficiently.

We also perform experiments with the implementation model described in this section for calculating the minimum on Promela using our counterexample method. We proceed similarly to the approach in Section 5, i.e., we find a counterexample of the logical requirement that a successful calculation cannot be reached is a particular program trace. Using this trace, we can find the optimal parameter values for our model.

Table 3: Experiments with the Minimum Promela model

N	Processing elements	Data size	WG	TS	Model time	Steps
1	4	16	8	2	20	1347
2	4	16	4	4	24	1229
3	4	16	2	4	25	1365
4	64	64	16	4	36	4398
5	64	64	8	8	44	4101
6	64	64	4	4	75	4101
7	64	128	8	16	76	7303
8	64	128	4	16	137	7493
9	64	128	4	8	139	7781
10	64	256	4	8	271	15024
11	64	256	4	4	279	16151
12	64	256	2	4	295	18491

Table 3 shows the results with Promela. We observe that the WG parameter, as well as in a full-scale experiment, affects the run time more dramatically than TS parameter. In the solutions with the best time for a given data size, there were always the largest value of this parameter among other solutions found by the model checker due to more intensive using of local memory as in real experiments.

Hence, we observe that our Promela-based approach to auto-tuning the Minimum problem implementation in OpenCL demonstrates the same time behaviour of the program performance depending on the tuning parameters WG and TS as the manual approach to finding the optimal parameter configuration. In our future work, relying on these positive experimental results, we will address scaling up our formal approach to much larger sizes of input data.

8 Conclusion

The long-term goal of our research is the development of rigorous, formally-based methods for auto-tuning high-performance parallel programs by relying on the model checking approach, in particular using the technique of counterexample-guided search. The results of auto-tuning are the optimal values of tuning parameters that ensure the best possible performance of the given program on a particular parallel architecture for a particular input data size. As our current target, we address high-performance programs written in OpenCL which is an emerging open standard for programming various kinds of parallel architectures, including multi-core CPU, GPU, and FPGA. Our approach described in this paper is based on developing an abstract model of OpenCL computations and then expressing the model of program execution and optimality properties in the formal language Promela, such this model can be investigated by the popular model checking tool SPIN.

The main advantage of our logic-based approach to auto-tuning high-performance programs is that is can be applied without involving the particular target execution platform. Other than the existing auto-tuners, our simulation of the execution of OpenCL programs in Promela abstracts from specific computations, keeping the logic of interaction and synchronization of parallel program processes in the target parallel architecture. Due to the formal semantics of the Promela language, the model of a given OpenCL program can be viewed as the formal operational semantics of interaction and synchronization of parallel program processes on a selected platform and device architecture. By varying the parameters and algorithms of interaction between the components of an abstract device, it is possible to define and study specific processor architectures. This makes it possible to search for optimal program settings in the absence of real target processors, while existing auto-tuning frameworks cannot provide such an opportunity.

We plan to further improve our approach by customizing our general auto-tuning method for a specific application program executed on a specific hardware, in addition to computing the minimum on the Nvidia P104-100 processor. In particular, we are going to conduct a case study with matrix multiplication on the most recent Ampere GPU architecture by Nvidia Corp. Knowing the particular specification, for example of the Ampere architecture, we can customize the ratio of the global memory access time to the local memory access time, and the performance time for every function used in the program by considering all primitive operations in the function. We will also take into account specific organisation of communication in the architecture, for example between Ampere units and their processing elements.

In our future work, we will also add the communication time between processes to the model, as well as consider other settings, in particular, the number of work items. In addition, to improve our counterexample method, we are going to try using the never claim pruning from [30], and finding optimal values from [11], which avoids multiple runs of the SPIN verifier in the counterexample method. We also plan to improve the scalability of our counterexample method by modeling a warp-based scheduling [31], because due to warping, the number of simultaneously working processing elements becomes smaller, which implies reduction of interleaving non-determinism in the model. Finally, we will work on scaling up our approach to much larger sizes of input data.

Acknowledgements

We are grateful to the anonymous referees of the LOPSTR’22 symposium for their great help in improving the initial, shorter version [22] of this paper.

References

[1] Hoos HH. Automated algorithm configuration and parameter tuning. In: Autonomous search, pp. 37–71. Springer, 2011.
[2] Ansel J, Kamil S, Veeramachaneni K, et al. OpenTuner: An extensible framework for program autotuning. In: Proc. of 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2017 pp. 385–392.
[3] Christen M, Schenk O, Burkhart H. PATUS: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In: Proc. of 2011 IEEE International Parallel Distributed Processing Symposium. IEEE, 2011 pp. 385–392.
[4] Clint W, Dongarra J. Automatically tuned linear algebra software. In: Proc. of the ACM/IEEE Conference on Supercomputing. IEEE, 1998 pp. 385–392.
[5] Frigo M, Johnson SG. The design and implementation of FFTW3. IEEE, 2005. 93(2):187–195.
[6] Nugteren C, Codreanu V. CLTUne: A generic auto-tuner for OpenCL kernels. In: Proc. of 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip. IEEE, 2015 pp. 385–392.
[7] Rasch A, Gorlatch S. ATF: A generic directive-based auto-tuning framework. Concurrency and Computation: Practice and Experience, 2019. 31(5):e4423. https://doi.org/10.1002/cpe.4423.
[8] Brinksma E, Mader A, Fehnker A. Verification and optimization of a PLC control schedule. Int. J. on Software Tools for Technology Transfer, 2002. 4:21–33.
[9] Malik R, Pena P. Optimal Task Scheduling in a Flexible Manufacturing System using Model Checking. IFAC-PapersOnLine, 2018. 51(7):230–235.
[10] Ruys TC, Brinksma E. Experience with Literate Programming in the Modelling and Validation of Systems. In: Proc. of the 4th Int. Conf. on Tools and Algorithms for the Construction and Analysis of Systems (TACAS’98). 1998 pp. 393–408.
[11] Ruys T. Optimal Scheduling Using Branch and Bound with SPIN 4.0. In: Proc. of Model Checking Software. SPIN 2003. Springer, 2003 pp. 1–17.
[12] Wijs A, Pol JVD, Bortnik EM. Solving scheduling problems by untimed model checking: The clinical chemical analyser case study. International Journal on Software Tools for Technology Transfer, 2016. 11(5):375–392.
[13] Cimatti A, Edelkamp S, Fox M, Magazzeni D, Plaku E. Automated Planning and Model Checking (Seminar 14482). Dagstuhl Reports, 2015. 4(11):227–245. 10.4230/DagRep.4.11.227.
[14] Clarke EM, Henzinger TA, Veith H. Handbook of Model Checking. Springer International Publishing, 2018.
[15] Lazreg S, Cordy M, Collet P, Heymans P, Mosser S. Multifaceted automated analyses for variability-intensive embedded systems. In: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 2019 pp. 854–865.
[16] Khronos. OpenCL Specification by Khronos OpenCL Working Group. 2021., 2021. URL https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/OpenCL\_API.html.
[17] Holzmann GJ. The SPIN Model Checker: Primer and Reference Manual. Addison-Wesley Professional, 2003.
[18] Nimble-code. Promela language grammar, 2021. URL http://spinroot.com/spin/Man/grammar.html.
[19] Hoare CAR. Communicating sequential processes. Prentice-Hall, 1985.
[20] Gaspari M, G Z. An Algebra of Actors. In: Proc. of Formal Methods for Open Object-Based Distributed Systems. FMOODS 1999. Springer, 1999 pp. 385–392.
[21] Garanina NO, Gorlatch SP. Autotuning parallel programs by model checking. Modeling and analysis of information systems, 2021, in Russian. 28(4):338–355.
[22] Garanina N, Staroletov S, Gorlatch S. Model Checking Meets Auto-Tuning of High-Performance Programs. In: Logic-Based Program Synthesis and Transformation: 32nd International Symposium, LOPSTR 2022, 2022, Proceedings. LNCS, Springer, 2022 pp. 63–82.
[23] Subhlok J, Stichnoth JM, O’hallaron DR, Gross T. Exploiting task and data parallelism on a multicomputer. In: Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming. 1993 pp. 13–22.
[24] Garanina N, Staroletov S. Discussed models in Promela, 2023. URL https://github.com/SergeyStaroletov/PromelaSamples/tree/master/autotune_opencl.
[25] Holzmann GJ, Joshi R, Groce A. Tackling large verification problems with the swarm tool. In: International SPIN Workshop on Model Checking of Software. Springer, 2008 pp. 134–143.
[26] Holzmann GJ, Joshi R, Groce A. Swarm verification techniques. IEEE Transactions on Software Engineering, 2010. 37(6):845–857.
[27] Staroletov S. Model checking games and a genome sequence search. In: Journal of Physics: Conference Series, volume 1679. IOP Publishing, 2020 p. 032020.
[28] NVidia corp. NVIDIA Tesla P100. Whitepaper, 2016. URL https://www.techpowerup.com/gpu-specs/docs/nvidia-gp100-architecture.pdf.
[29] TechPowerUp. GPU Database. P104-100 Specs, 2017. URL https://www.techpowerup.com/gpu-specs/nvidia-p104-100-8-gb.b8158.
[30] Panizo L, Salmerón A, Gallardo MdM, Merino P. Guided test case generation for mobile apps in the TRIANGLE project: work in progress. In: Proceedings of the 24th ACM SIGSOFT International SPIN Symposium on Model Checking of Software. 2017 pp. 192–195.
[31] Lashgar A, Baniasadi A, Khonsari A. Warp size impact in GPUs: large or small? In: Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units. 2013 pp. 146–152.

Auto-Tuning High-Performance Programs Using Model Checking in Promela

Abstract

keywords:

1 Motivation and Related Work

2 Model Checking for Auto-Tuning: The Idea of Our Approach

3 Our Modeling Approach: From OpenCL to Promela

3.1 An Abstract OpenCL Platform Model

3.2 An Abstract OpenCL Program

4 Using SPIN for Auto-Tuning OpenCL Programs

5 Adaptation for Limited Computation Resources: Using Swarms

6 Experiments with the Promela Abstract Model

7 Application Use Case: Auto-Tuning the Minimum Problem

7.1 OpenCL Program for the Minimum Problem

7.2 From the Minimum OpenCL Kernel to its Promela Model

7.3 The experimental results for the Minimum Problem

8 Conclusion

Acknowledgements

References

Auto-Tuning High-Performance Programs
Using Model Checking in Promela