Concurrent CPU-GPU Task Programming using Modern C++

anonymous1 1affiliation left empty for blind review Tsung-Wei Huang1 and Yibo Lin2 1Department of Electrical and Computer Engineering, University of Utah 2Department of Computer Science, Peking University

Abstract

In this paper, we introduce Heteroflow, a new C++ library to help developers quickly write parallel CPU-GPU programs using task dependency graphs. Heteroflow leverages the power of modern C++ and task-based approaches to enable efficient implementations of heterogeneous decomposition strategies. Our new CPU-GPU programming model allows users to express a problem in a way that adapts to effective separation of concerns and expertise encapsulation. Compared with existing libraries, Heteroflow is more cost-efficient in performance scaling, programming productivity, and solution generality. We have evaluated Heteroflow on two real applications in VLSI design automation and demonstrated the performance scalability across different CPU-GPU numbers and problem sizes. At a particular example of VLSI timing analysis with million-scale tasking, Heteroflow achieved $7.7\times$ runtime speed-up (99 vs 13 minutes) over a baseline on a machine of 40 CPU cores and 4 GPUs.

I Introduction

Modern parallel applications in machine learning, data analytics, and scientific computing typically consist of a heterogeneous use of both central processing units (CPUs) and graphics processing units (GPUs) [Vetter_18_01]. Writing a parallel CPU-GPU program is never an easy job, since CPUs and GPUs have fundamentally different architectures and programming logic. To address this challenge, the parallel computing community has investigated many programming libraries to assist developers with quick access to massively parallel and heterogeneous computing resources using minimal programming effort [CUDA, OpenCL, hiCUDA, Ompss, OpenMPC, OpenACC, StarPU, SYCL, HPX, PaRSEC]. In particular, hybrid multi-CPU multi-GPU systems are driving high demand for new heterogeneous programming techniques in support for more efficient CPU-GPU collaborative computing [Mittal_15_01]. However, related research remains nascent, especially on the front of leveraging modern C++ to achieve new programming productivity and performance scalability that were previously out of reach [Huang_19_01].

The Heteroflow project addresses a long-standing question: “how can we make it easier for C++ developers to write efficient CPU-GPU parallel programs?” For many C++ developers, achieving high performance on a hybrid CPU-GPU system can be tedious. Programmers have to overcome complexities arising out of concurrency controls, kernel offloading, scheduling, and load-balancing before diving into the real implementation of a heterogeneous decomposition algorithm. Heteroflow adopts a new task-based programming model using modern C++ to address this challenge. Consider the canonical saxpy (A·X plus Y) example in Figure 1. Each Heteroflow task belongs to one of host, pull, push, and kernel tasks; a host task runs a callable object on any CPU core (“the host”), a pull task copies data from the host to a GPU (“the device”), a push task copies data from a GPU to the host, and a kernel task offloads computation to a GPU. Figure 1 explains the saxpy task graph in Heteroflow’s graph language.

Refer to caption — Figure 1: A saxpy (“single-precision A·X plus Y”) task graph using two host tasks to create two data vectors, two pull tasks to send data to a GPU, a kernel task to offload the saxpy computation to the GPU, and two push tasks to push data from the GPU to the host.

⬇

__global__ void saxpy(int n, int a, int *x, int *y){

int i = blockIdx.x * blockDim.x + threadIdx.x;

if (i < n) y[i] = a*x[i] + y[i];

}

const int N = 65536;

vector<int> x, y;

hf::Executor executor;

hf::Heteroflow G;

auto host_x = G.host([&](){ x.resize(N, 1); });

auto host_y = G.host([&](){ y.resize(N, 2); });

auto pull_x = G.pull(x);

auto pull_y = G.pull(y);

auto kernel = G.kernel(saxpy, N, 2, pull_x, pull_y)

.block_x(256)

.grid_x((N+255)/256)

auto push_x = G.push(pull_x, x);

auto push_y = G.push(pull_y, y);

host_x.precede(pull_x);

host_y.precede(pull_y);

kernel.precede(push_x, push_y)

.succeed(pull_x, pull_y);

auto future = executor.run(hf);

Listing 1: Heteroflow code of Figure 1.

Listing 1 shows the Heteroflow code that implements the saxpy task graph in Figure 1. The code explains itself. The program creates a task dependency graph of two host tasks, two pull tasks, one kernel task, and two push tasks. The kernel task binds to a saxpy kernel written in CUDA [CUDA]. The dependency links form constraints that conform to Figure 1. Heteroflow provides an executor interface to perform automatic parallelization of a task graph scalable to manycore CPUs and GPUs. There is no explicit thread managements or fine-grained concurrency controls in the code. Our design principle is to let users write simple, expressive, and transparent parallel code. Heteroflow explores a minimum set of core routines that are sufficient enough for users to implement a broad set of heterogeneous computing algorithms. Our task application programming interface (API) is not only flexible on the user front, but also extensible with the future evolution of C++ standards and heterogeneous architectures. We summarize our contributions as follows:

•

Programming model. We develop a new parallel CPU-GPU programming model to assist developers with efficient access to heterogeneous computing resources. Our programming model allows users to express a problem with effective separation of concerns and expertise encapsulation. Developers can work at a suitable level of granularity for writing scalable applications that is commensurate with their domain knowledge.
•

Transparency. Heteroflow is transparent. Developers need not to deal with standard concurrency mechanisms such as threads and fine-grained concurrency controls, that are often tedious and hard to program correctly. Instead, our system runtime abstracts these problems from developers and tackles many of the hardest parallel and heterogeneous computing details, notably resource allocation, CPU-GPU co-scheduling, kernel offloading, etc.
•

Expressiveness. We leverage modern C++ to design an expressive API that empowers users with explicit graph construction and refinement to fully exploit task parallelism in their applications. The expressive power also lets developers perform rather a lot of work without writing a lot of code. Our user experiences lead us to believe that, although it requires some effort to learn, most C++ programmers can master our APIs and apply Heteroflow to their jobs in just a few hours.

We have applied Heteroflow to two real applications, timing analysis and cell placement, in large-scale circuit design automation and demonstrated the performance scalability across different numbers of CPUs, GPUs, and problem sizes. We believe Heteroflow stands out as a unique tasking library considering the ensemble of software tradeoffs and architecture decisions we have made. With that being said, different programming libraries and frameworks have their pros and cons, and deserve a particular reason to exist. Heteroflow aims for a higher-level alternative in modern C++ domain.

II Motivation

Heteroflow is motivated by our research projects to develop efficient computer-aided design (CAD) tools for very large scale integration (VLSI) design automation. CAD has been an immensely successful field in assisting designers in implementing VLSI circuits with billions of transistors. It was on the forefront of computing around 1980 and has fostered many prominent problems and algorithms in computer science. Figure 2 demonstrates a conventional VLSI CAD flow with a highlight on physical design. Due to the ever-increasing design complexity, the recent CAD community is driving the need for hybrid CPU-GPU computing to keep tool performance up with the technology scaling [Stok_14_01, Lu_18_01].

II-A Challenge 1: Vast and Complex Dependencies

Computational problems in CAD are extremely complex and have many challenges that normal software developments do not have. The biggest challenge to develop parallel CAD tools is the vast and complex task dependencies. Before evaluating an algorithm, a number of logical and physical information must arrive first. These quantities are often dependent to each other and are expensive to compute. The resulting task dependency graph in terms of encapsulated function calls can be very large. For example, a million-gate design can produce a graph of billions of tasks and dependencies that takes several days to accomplish [Huang_19_01]. However, such difficulty does not prevent CAD tools from parallelization, but highlights the need of new tasking frameworks to implement efficient parallel decomposition strategies especially with CPU-GPU collaborative computing [Lu_18_01].

II-B Challenge 2: Extensive Domain Knowledge

Developing a parallel CAD algorithm requires deep and broad domain knowledge across circuits, modeling, and programming to fully exploit parallelism. The compute pattern is highly irregular and unbalanced, requiring very strategic collaboration between CPU and GPU. Developers often need direct access to native GPU programming libraries such as CUDA and OpenCL to handcraft the kernels with problem-specific knowledge [CUDA, OpenCL]. Existing frameworks that provide high-level abstraction over kernel programming always come with restricted applicability, preventing CAD engineers from using many new powerful features of the native libraries. Our domain experience concludes that despite nontrivial GPU kernels, what makes concurrent CPU-GPU programming an enormous challenge is the vast and complex surrounding tasks, most notably the resource controls on multi-GPU cards, CPU-GPU co-scheduling, tasking, and synchronization.

II-C Need for a New CPU-GPU Programming Solution

Unfortunately, most parallel CPU-GPU programming solutions in CAD tools are hard-coded [Stok_14_01, Lu_18_01]. Developers are “heroic programmers” to handcraft every detail of a heterogeneous decomposition algorithm and explicitly decide which part of the application runs on which CPU and GPU. While the performance is acceptable, it is too expensive to maintain the codebase and scale to new hardware architectures. Some recent solutions adopted directive-driven models such as OpenMP GPU and OpenACC particularly for data-intensive algorithms [OpenMPC, OpenACC]. However, these approaches cannot handle dynamic workloads since compilers have limited knowledge to annotate runtime task parallelism and dynamic dependencies. In fact, frameworks at functional level are more favorable due to the flexibility in runtime controls and on-demand tasking. Nevertheless, most libraries on this front are disadvantageous from an ease-of-programming standpoint [Mittal_15_01]. Users often need to sort out many distinct notations and library details before implementing a heterogeneous algorithm [Beri_16_01]. Also, a lack of support for modern C++ largely inhibits the programming productivity and performance scalability [Huang_19_01, Huang_22_01]. After many years of research, we and our industry partners conclude the biggest hurdle to program the power of collaborative CPU-GPU computing is a suitable task programming library. Whichever model is used, understanding the structure of an application is critical. Developers must explicitly consider possible data or task parallelism of their problems and leverage domain-specific knowledge to design effective decomposition strategies for parallelization. At the same time, the library runtime removes the burden of low-level jobs from developers to improve programming productivity and transparent scalability. To this end, our goal is to address these challenges and develop a general-purpose tasking interface for concurrent CPU-GPU programming.

III Heteroflow

In this section, we discuss the programming model and runtime of Heteroflow. We will cover important technical details that support the software architecture of Heteroflow.

Heteroflow aims to help C++ developers quickly write CPU-GPU parallel programs and implement efficient heterogeneous decomposition strategies using task-based models.

— Heteroflow’s Project Mantra

III-A Create a Task Dependency Graph

Heteroflow is object-oriented. Users can create multiple task dependency graph objects each representing a unique parallel decomposition in an application. A task dependency graph is a directed acyclic graph (DAG) with nodes and edges representing tasks and dependency constraints, respectively. Each task belongs to one of the four categories: host, pull, push, and kernel.

III-A1 Host Task

A host task is associated with a callable object which can be a function object, binding expression, functor, or a lambda expression. The callable is invoked at runtime by a CPU thread to run on a CPU core. Listing 2 gives an example of creating a host task. In most applications, the callable is described in C++ lambda to construct a closure inline in the source code. This property allows host task to enable efficient lazy evaluation and capture any data whether it is declared in a local block or flat in a global scope, largely facilitating the ease of programming.

⬇

hf::Heteroflow hf;

auto host = hf.host(

[] () { cout << ”task runs on a CPU core”; }

);

Listing 2: Creates a host task.

Each time users create a task, the heteroflow object adds a node to its task graph and returns a task handle to users. A task handle is a lightweight class object that wraps a pointer to a graph node. The purpose of this extra layer is to provide an extensible mechanism for users to modify the task attributes and, most importantly, prevents users from direct access to the internal graph storage which can easily introduce undefined behaviors. Each node has a general-purpose polymorphic function wrapper to store and invoke different callables according to a task type. A task handle can be empty, often used as a placeholder when it is not associated with a graph node. This is particularly useful when a task content cannot be decided until a certain point during the program execution, while the task storage needs preallocation at programming time. These properties are applicable to all task types.

III-A2 Pull Task

A pull task lets users pull data from the host to the device. The exact GPU to perform this memory operation is decided by the scheduler at runtime. Developers should think separately which part of their applications runs on which space, and decompose them with explicit task construction. Since most GPU memory operations are expensive compared to CPU counterparts, Heteroflow splits the execution of a GPU workload into three operations, host-do-device (H2D) input transfers, launch of a kernel, and device-to-host (D2H) output transfers, to enable more task overlaps. Pull task adopts this strategy to help users manage the tedious details in H2D data transfers. At the same time, it presents an effective abstraction of which the scheduler can take advantage to perform various optimizations such as automatic GPU mapping, streaming, and memory pooling.

⬇

vector<int> data1(100);

float* data2 = new float[10];

auto pull1 = hf.pull(data1);

auto pull2 = hf.pull(data2, 10);

Listing 3: Creates two pull tasks.

Listing 3 gives an example of creating two pull tasks to transfer data from the host to the device. The first pull task operates on a C++ vector of integer numbers and the second pull task operates on a raw data block of real numbers. Heteroflow employs the C++20 span syntax to implement the pull interface. The arguments forwarded to the pull method must conform to the constructor of std::span. In fact, we have investigated many possible data representations and decided to use span because of its lightweight abstraction for describing a contiguous sequence of objects. A span can easily convert to a C-style raw data view that is acceptable by most GPU programming libraries [CUDA, OpenCL, OpenGL]. Sticking with C++ standard also keeps the core of Heteroflow portable and minimizes the rate of change required for our data interface.

⬇

1template <typename… ArgsT>

2auto PullTask::pull(ArgsT&&… args) {

3 get_node_handle().work = [

4 t=StatefulTuple(forward<ArgsT>(args)…)

5 ] (Allocator& a, cudaStream_t s) mutable {

6 auto h_span = make_span_from_tuple(t);

7 auto h_data = h_span.data();

8 auto h_size = h_span.size_bytes();

9 auto d_data = a.allocate(h_size);

10 cudaMemcpyAsync(

11 d_data, h_data, h_size, H2D, s

12 );

13 };

14 return *this;

15}

Listing 4: Implementation details of the pull task.

Listing 4 highlights the core implementation of the pull task based on CUDA. ¹¹1While the current implementation is based on CUDA, our task interface can accept other GPU programming libraries [OpenCL]. To be concise, we omit details such as error checking and auxiliary functions. The pull task forms a closure that captures the arguments in a custom tuple by which we enable stateful task execution (line 4). For instance, in Listing 1, the change made by the host task host_x on the data vectors must be visible to the pull task pull_x. The stateful tuple wraps references in objects to keep state transition consistent between dependent tasks. Maintaining a stateful transition is a backbone of Heteroflow. Developers can carry out fine-grained concurrency through decomposition and enforce dependency constraints to keep the logical relationship between task data. In terms of arguments, the runtime passes a memory allocator and a CUDA stream to the closure (line 5). The allocator is a pooled resource for reducing GPU memory allocation overhead and the CUDA stream is a sequenced mechanism for interleaving GPU operations [CUDA]. A key motivation behind this design is to support multi-GPU computing. Both the memory allocator and stream are specific to a GPU context which is decided by the scheduler at runtime. Finally, we create a span from the stateful tuple and enqueue the data transfer operation to the stream (line 6:12).

III-A3 Push Task

A push task lets users push data associated with a pull task from the device to the host. The code snippet in Listing 5 creates two push tasks that operate on the pull tasks in Listing 3. The arguments consist of two part, a source pull task of device data and the rest to construct a std::span object for the target. Similar to Listing 3, the first push task operates on an integer vector and the second push task operates on a raw data block of floating numbers. Push task is stateful. Any runtime change on the arguments that were used to construct a pull task will reflect on its execution context. This property allows users to create stateful Heteroflow graphs for efficient data management between concurrent CPU and GPU tasks.

⬇

auto push1 = hf.push(pull1, data1);

auto push2 = hf.push(pull2, data2, 10);

Listing 5: Creates two push tasks from the two pull tasks in Listing 3.

⬇

1template <typename… ArgsT>

2auto PushTask::push(PullTask p, ArgsT&&… args){

3 get_node_handle().work = [

4 src=p,

5 t=StatefulTuple(forward<ArgsT>(args)…)

6 ] (cudaStream_t s) mutable {

7 auto h_span = make_span_from_tuple(t);

8 auto h_data = h_span.data();

9 auto h_size = h_span.size_bytes();

10 auto d_data = src.device_data();

11 cudaMemcpyAsync(

12 h_data, d_data, h_size, D2H, s

13 );

14 };

15 return *this;

16}

Listing 6: Implementation details of the push task.

Listing 6 highlights the core implementation of the push task based on CUDA. The push task captures the argument list in the same way as the pull task to form a stateful closure (line 5). The execution context creates a span from the target and extracts the device data from the source pull task (line 7:10). Finally, we enqueue the data transfer operation to a CUDA stream passed by the scheduler at runtime (line 11:13). This CUDA stream is guaranteed to live in the same GPU context as the source pull task. In short, Heteroflow uses pull tasks and push tasks to perform H2D and D2H data transfers. Users explicitly specify the data to transfer between CPU and GPU, and encode these tasks in a graph to exploit task parallelism. They never worry about the underlying details of resource allocation and GPU placement.

III-A4 Kernel Task

A kernel task offloads computation from the host to the device. Heteroflow empowers users with explicit kernel programming using native CUDA toolkits. We never try hard to develop another C++ kernel programming framework that often comes with restricted applicability and performance portability. Instead, users leverage their domain knowledge with the highest degree of freedom to implement their kernel algorithms, while leaving task parallelism to Heteroflow. Listing 7 gives an example of creating two kernel tasks that offload two given CUDA kernel functions to the device using the pull tasks created in Listing 3. The first kernel task operates on kernel1 with data from pull1. The second kernel task operates on kernel2 with data from pull2. Both tasks configure 256 CUDA threads in a block. Kernel functions are not obligated to take any Heteroflow-specific objects. This largely increases the portability and testability of Heteroflow, especially for applications that heavily use third-party kernel functions written by domain experts.

⬇

__global__ void kernel1(int* data, int N);

__global__ void kernel2(float* data, int N);

auto k1 = hf.kernel(kernel1, pull1, 100)

.grid_x(N/256)

.block_x(256);

auto k2 = hf.kernel(kernel2, pull2, 10);

.grid(N/256, 1, 1)

.block(256, 1, 1);

Listing 7: Creates two kernel tasks that operate on the two pull tasks in Listing 3.

⬇

1template <typename F, typename… ArgsT>

2auto KernelTask::kernel(F&& f, ArgsT&&… args) {

3 gather_sources(args…);

4 get_node_handle().work = [

5 k=*this, f=forward<F>(f),

6 t=StatefulTuple(forward<ArgsT>(args)…)

7 ] (cudaStream_t s) mutable {

8 k.apply_kernel(s, f, t);

9 };

10 return *this;

11}

13template <typename T>

14auto KernelTask::gather_sources(T&&… tasks) {

15 if constexpr(is_pull_task<T>) {

16 (get_node_handle().add_sources(tasks), …);

17 }

18}

20template<typename F, typename T>

21auto KernelTask::apply_kernel(

22 cudaStream_t s, F f, T t

23) {

24 const auto N = tuple_size<T>::value;

25 apply_kernel(s, f, t, make_index_sequence<N>{});

26}

28template<typename F, typename T, size_t … I>

29auto KernelTask::apply_kernel(

30 cudaStream_t s, F f, T t, index_sequence<I …>

31) {

32 auto& h = get_node_handle();

33 f<<<h.grid, h.block, h.shm, s>>>(

34 convert(get<I>(t))…

35 );

36}

Listing 8: Implementation details of the kernel task.

Listing 8 highlights the core implementation of the kernel task. The kernel method takes a kernel function written in CUDA and the rest arguments to invoke the kernel (line 1:2). The arity must match in both sides. A key difference between Heteroflow and existing models is the way we establish data connection – we use pull tasks as the gateway rather than raw pointers. This abstraction largely improves safety and transparency in scaling graph execution to multiple GPUs. From the input argument list, we gather all relevant pull tasks to this kernel (line 3 and line 13:18) and let the scheduler perform automatic device placement. Similar to push and pull tasks, we capture the argument list in a stateful tuple (line 6) and use two auxiliary functions to invoke the kernel from the tuple (line 20:36). All the runtime changes on the arguments will reflect on the execution context of the kernel.

⬇

1struct PointerCaster {

2 void* data {nullptr};

3 template <typename T>

4 operator T* () {

5 return (T*)data;

6 }

7};

9template <typename T>

10auto KernelTask::convert(T&& arg) {

11 if constexpr(is_pull_task<T>) {

12 return PointerCaster{arg.data()};

13 }

14 else {

15 return forward<T>(arg);

16 }

17}

Listing 9: Implementation details of the data connection between a pull task and a kernel task.

Each argument in the kernel function must experience another conversion (line 34 in Listing 8) before launching the kernel. The purpose of this conversion is to transform the pull task to the type of the corresponding kernel argument, and to possibly conduct any sanity checks at both compile time and runtime. Listing 9 highlights the core implementation of this conversion. The function convert evaluates an argument at compile time (line 9:17). If the argument is a pull task, it returns a cast of the internal GPU data pointer to the target argument type (line 11:13). Otherwise, it forwards the argument in return (line 15). The auxiliary structure PointerCaster (line 1:7) is designed to operate on plain old data (POD) pointers in support for conventional GPU kernel programming syntaxes. The same concept apply to custom data types depending on a compiler’s capability.

III-A5 Add a Dependency Link

After tasks are created, the next step is to add dependency links. A dependency link is a directed edge between two tasks to force one task to run before or after another. Heteroflow defines two very intuitive methods, precede and succeed, to let users create task dependencies. The two methods are symmetrical to each other. A preceding link forces a task to run before another and a succeeding link forces a task to run after another. Heteroflow’s task interface is uniform. Users can insert dependencies between tasks of different types as long as no cycles are formed.

⬇

__global__ void k1(int* vec1);

__global__ void k2(int* vec1, int* vec2);

vector<int> vec1, vec2;

hf::Heteroflow hf;

auto host1 = hf.host([](){ vec1.resize(100, 0); });

auto host2 = hf.host([](){ vec2.resize(100, 1); });

auto pull1 = hf.pull(vec1);

auto pull2 = hf.pull(vec2);

auto push1 = hf.push(pull1, vec1);

auto push2 = hf.push(pull2, vec2);

auto kernel1 = hf.kernel(k1, pull1);

auto kernel2 = hf.kernel(k2, pull1, pull2);

host1.precede(pull1);

host2.precede(pull2);

pull1.precede(kernel1);

pull2.precede(kernel2);

kernel1.precede(push1, kernel2);

kernel2.precede(push2);

Listing 10: Creates dependency links to describe Figure 3.

Listing 10 gives an example of using the method precede to describe the dependency graph in Figure 3. Users can precede an arbitrary number of tasks in one call. The overall code to create dependency links in Heteroflow is very simple, concise, and self-explanatory. An important takeaway here is that task dependency is explicit in Heteroflow. Our API never creates implicit dependency links even though they are obvious in certain graphs. Such concern typically arises when creating a kernel task that requires GPU data from other pull tasks. In this scenario, pull tasks must finish before the kernel task and users are responsible for this dependency in their graphs. Heteroflow delegates the dependency controls to users so they can tailor graphs to their needs. With careful graph construction and refinement, applications can efficiently reuse data without adding redundant task dependencies. For example, kernel2 in Figure 3 can access the GPU data of pull1 as a result of transitive dependency (pull1 precedes kernel1 and kernel1 precedes kernel2). Listing 10 implements this intent.

III-A6 Inspect a Task Dependency Graph

Another powerful feature of Heteroflow on the user front is the visualization of a task dependency graph using the standard DOT format. Users can find readily available tools such as Python Graphviz and viz.js to draw a graph without extra programming effort. Graph visualization largely facilitates testing and debugging of Heteroflow applications. Listing 11 gives an example of dumping a Heteroflow graph to the standard output.

⬇

hf.dump(cout);

cout << hf.dump();

Listing 11: Dumps a Heteroflow graph to the standard output.

III-B Execute a Task Dependency Graph

An executor is the basic building block for executing a Heteroflow graph. It manages a set of CPU threads and GPU devices to schedule in which list of tasks to execute. When a task is ready, the runtime submits the task to an execution context which can occur in either a physical CPU core or a GPU device. In Heteroflow, a task is indeed a callable. When users create a task, Heteroflow marshals all required parameters along with unique placeholders for runtime arguments to form a closure that can be run by any CPU thread. Execution of a GPU task will be placed under a GPU context. The scheduler manages all such details to ensure consistent results across multiple GPUs. Listing 12 creates an executor of eight CPU threads and four GPUs and uses it to execute a graph one times, 100 times, and multiple times until a stopping criteria is met. Users can adjust the number based on hardware capability to easily scale their graphs across different CPU-GPU configurations. All the run methods in the executor class are non-blocking. Issuing a run on a graph returns immediately with a C++ future object. Users can use it to inspect the execution status of the graph or chain up a continuation for asynchronous controls. The executor class also provides a method wait_for_all that blocks until all running graphs associated with the caller executor finish. Heteroflow’s executor interface is thread-safe. Touching an executor from multiple threads is valid. Users can take advantage of this property to explore higher-level parallelism without concerning about race in execution.

⬇

hf::Executor executor(8, 4); // 8 CPU threads 4 GPUs

hf::Heteroflow graph;

auto future1 = executor.run(graph);

auto future2 = executor.run_n(graph, 100);

auto future3 = executor.run_until(graph, [&] () {

return custom_stopping_criteria();

});

executor.wait_for_all();

Listing 12: Creates an executor to run a Heteroflow graph.

III-C Scheduling Algorithm

Another major contribution of Heteroflow is the design of a scheduler on top of our heterogeneous tasking interface. Scheduler is an integral part of the executor for mapping task graphs onto available CPU cores and GPUs. When an executor is created with $N$ CPU threads and $M$ GPUs, we spawn $N$ CPU threads, namely workers, to execute tasks. Unlike existing works [StarPU, XKAAPI++], we do not dedicate a worker to manage a target GPU, since all tasks are uniformly represented in Heteroflow using polymorphic functional objects (see Listings 4, 6, and 8). This largely facilitates the design of our scheduler in providing efficient resource utilization and flexible runtime optimizations, for instance, GPU memory allocators, asynchronous CUDA streams, and task fusing.

Our scheduler design is motivated by [Huang_19_01]. When a graph is submitted to an executor, a special data structure called topology is created to marshal execution parameters and runtime metadata. Each heteroflow object has a list of topologies to track individual execution status. The executor also maintains a topology counter to signal callers on completion. The communication is based on a shared state managed by a pair of C++ promise and future objects. The first step in scheduling is device placement, mapping each GPU task to a particular GPU device. An advantage of our programming model is implicit data dependencies between a kernel and its pull tasks (see line 3 in Listing 8), through which the scheduler can utilize to place them under the right device. Based on this property, we develop a simple and efficient device placement algorithm using union-find and bin packing as shown in Algorithm 1. The key idea is to group each kernel with its source pull tasks (line 1:7) and then pack each unique group to a GPU bin with an optimized cost (line 8:14). By default, we minimize the load per GPU bins for maximal concurrency but can expose this strategy to a pluggable interface for custom cost metrics.

1 foreach t $\in$ tasks do

2 if t.type() $==$ KERNEL then

3 foreach p $\in$ t.source_pull_tasks() do

4 set_union(

t

p

);

5 end foreach

7 end if

9 end foreach

10foreach t $\in$ tasks do

11 if x $\leftarrow$ t.type(); x == KERNEL or x == PULL then

12 if $r\leftarrow$ set_find( $t$ ); is_set_root(r) then

13 set_bin_packing_with_balanced_load(

t

);

14 end if

16 end if

18 end foreach

Algorithm 1 DevicePlacement

After device placement, the scheduler enters a work-stealing loop where each worker thread iteratively drains out tasks from its local queue and transitions to a thief to steal a task from a randomly selected peer called victim. The process stops when an executor is destroyed. We employ work-stealing because it has been extensively studied and used in many parallel processing systems for dynamic load-balancing and irregular computations [Lima_15_01, Lin_20_01]. When a worker thread executes a task, it applies a visitor pattern that invokes a separate method for each task type. Running a host task is trivial, but calling a GPU task must be scoped under the right execution context. Heteroflow provides a resource acquisition is initialization (RAII)-style mechanism on top of CUDA device API to scope the task execution under its assigned GPU device. Listing 13 gives the implementation details of invoking a pull task from an executor. All GPU tasks are synchronized through CUDA events (line 4 and line 6).

⬇

1void Executor::invoke(unsigned me, Pull& h) {

2 auto [d, s, e] = get_device_stream_event(me, h);

3 ScopedDeviceContext ctx(d);

4 cudaEventRecord(e, s);

5 h.work(get_device_allocator(d), s);

6 cudaStreamWaitEvent(s, e, 0);

Listing 13: Implementation details of invoking a pull task.

While detailing the scheduler design is out of the scope of this paper, there are a few notable items. First, each worker keeps a per-thread CUDA stream to enable concurrent GPU memory and kernel operations. Second, our executor keeps a memory pool for each GPU device to reduce the scheduling overhead of frequent allocations by pull tasks. We implement the famous Buddy allocator algorithm [Buddy]. Third, our work-stealing loop adopts an adaptive strategy to balance working and sleeping threads on top of available task parallelism. The key idea is to ensure one thief exists as long as an active worker is running a task. At the time of this writing, our scheduler design might not be perfect, but it provides a proof of concept for our programming model and fosters future research opportunities for new algorithms.

IV Experimental Results

We evaluated the performance of Heteroflow on two real VLSI CAD applications, timing analysis and standard cell placement. Each application represents a unique computation pattern. All experiments ran on a Ubuntu Linux 5.0.0-21-generic x86 64-bit machine with 40 Intel Xeon Gold 6138 CPU cores at 2.00 GHz, 4 GeForce RTX 2080 GPUs, and 256 GB RAM. The timing analysis program is compiled by g++8.2 and nvcc CUDA 10.1 with C++14 standards -std=c++14 and optimization flags -O2. The placement program is compiled under the same environment. Both programs are derived from our open-source projects, OpenTimer [Huang_15_01, Huang_21_01, Huang_21_02] and DREAMPlace [DREAMPlace], that consist of complex domain-specific algorithms with more than 10K lines of code over years of development.

IV-A VLSI Timing Analysis

We applied Heteroflow to solve a VLSI timing analysis problem. Timing analysis is a very important component in the overall design flow (see Figure 2). It verifies the expected timing behaviors of a digital circuit to ensure correct functionalities after tape-out. Among various timing analysis problems, one subject is to find the correlation between different timing views. Each each view represents a unique combination of a process variation corner (e.g., temperature, voltage) and an analysis mode (e.g., testing, functional). Figure 4 shows the number of required analysis views increases exponentially as the technology node advances [Huang_15_01, Huang_21_01]. Timing correlation is not only important for reasoning the behavior of a timer but also useful for building regression models to reduce required analysis iterations.

In reality, there are many ways to conduct timing analysis and correlation. In this experiment, we consider a representative three-step flow: a timer generates analysis datasets from a circuit design across multiple views; a hybrid CPU-GPU algorithm extracts timing statistics and generates regression models for each dataset; a synchronization step combines all assessed quantities to a concrete report. Figure LABEL:fig::exp-timing-task-graph illustrates a fractional task graph of two views. We use the open-source tool, OpenTimer, to generate 1024 different timing reports for a large circuit, netcard, of 1.5M gates [Huang_15_01, Huang_21_01]. The correlation layer implements a CPU-based algorithm to e