This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Multiple-GPU accelerated high-order gas-kinetic scheme for direct numerical simulation of compressible turbulence

Yuhang Wang [email protected] Guiyu Cao [email protected] Liang Pan [email protected] Laboratory of Mathematics and Complex Systems, School of Mathematical Sciences, Beijing Normal University, Beijing, China Academy for Advanced Interdisciplinary Studies, Southern University of Science and Technology, Shenzhen, China
Abstract

High-order gas-kinetic scheme (HGKS) has become a workable tool for the direct numerical simulation (DNS) of turbulence. In this paper, to accelerate the computation, HGKS is implemented with the graphical processing unit (GPU) using the compute unified device architecture (CUDA). Due to the limited available memory size, the computational scale is constrained by single GPU. To conduct the much large-scale DNS of turbulence, HGKS also be further upgraded with multiple GPUs using message passing interface (MPI) and CUDA architecture. The benchmark cases for compressible turbulence, including Taylor-Green vortex and turbulent channel flows, are presented to assess the numerical performance of HGKS with Nvidia TITAN RTX and Tesla V100 GPUs. For single-GPU computation, compared with the parallel central processing unit (CPU) code running on the Intel Core i7-9700 with open multi-processing (OpenMP) directives, 7x speedup is achieved by TITAN RTX and 16x speedup is achieved by Tesla V100. For multiple-GPU computation, multiple-GPU accelerated HGKS code scales properly with the increasing number of GPU. The computational time of parallel CPU code running on 10241024 Intel Xeon E5-2692 cores with MPI is approximately 33 times longer than that of GPU code using 88 Tesla V100 GPUs with MPI and CUDA. Numerical results confirm the excellent performance of multiple-GPU accelerated HGKS for large-scale DNS of turbulence. Alongside the footprint in reduction of loading and writing pressure of GPU memory, HGKS in GPU is also compiled with FP32 precision to evaluate the effect of number formats precision. Reasonably, compared to the computation with FP64 precision, the efficiency is improved and the memory cost is reduced with FP32 precision. Meanwhile, the differences in accuracy for statistical turbulent quantities appear. For turbulent channel flows, difference in long-time statistical turbulent quantities is acceptable between FP32 and FP64 precision solutions. While the obvious discrepancy in instantaneous turbulent quantities can be observed, which shows that FP32 precision is not safe for DNS in compressible turbulence. The choice of precision should depended on the requirement of accuracy and the available computational resources.

keywords:
High-order gas-kinetic scheme, direct numerical simulation, compressible turbulence, multiple-GPU accelerated computation.

1 Introduction

Turbulence is ubiquitous in natural phenomena and engineering applications [1]. The understanding and prediction of multiscale turbulent flow is one of the most difficult problems for both mathematics and physical sciences. Direct numerical simulation (DNS) solves the Navier-Stokes equations directly, resolve all scales of the turbulent motion, and eliminate modeling entirely [2]. With the advances of high-order numerical methods and supercomputers, great success has been achieved for the incompressible and compressible turbulent flow, such as DNS in incompressible isotropic turbulence [3], incompressible turbulent channel flow [4, 5], supersonic isotropic turbulence [6, 7], compressible turbulent channel flows [8, 9, 10], and compressible flat plate turbulence from the supersonic to hypersonic regime [11, 12].

In the past decades, the gas-kinetic scheme (GKS) has been developed systematically based on the Bhatnagar-Gross-Krook (BGK) model [13, 14] under the finite volume framework, and applied successfully in the computations from low speed flow to hypersonic one [15, 16]. The gas-kinetic scheme presents a gas evolution process from kinetic scale to hydrodynamic scale, where both inviscid and viscous fluxes are recovered from a time-dependent and genuinely multi-dimensional gas distribution function at a cell interface. Starting from a time-dependent flux function, based on the two-stage fourth-order formulation [17, 18], the high-order gas-kinetic scheme (HGKS) has been constructed and applied for the compressible flow simulation [7, 19, 20, 21]. The high-order can be achieved with the implementation of the traditional second-order GKS flux solver. More importantly, the high-order GKS is as robust as the second-order scheme and works perfectly from the subsonic to hypersonic viscous heat conducting flows. Originally, the parallel HGKS code was developed with central processing unit (CPU) using open multi-processing (OpenMP) directives. However, due to the limited shared memory, the computational scale is constrained for numerical simulation of turbulence. To perform the large-scale DNS, the domain decomposition and the message passing interface (MPI) [22] are used for parallel implementation [23]. Due to the explicit formulation of HGKS, the CPU code with MPI scales properly with the number of processors used. The numerical results demonstrates the capability of HGKS as a powerful DNS tool from the low speed to supersonic turbulence study [23].

Graphical processing unit (GPU) is a form of hardware acceleration, which is originally developed for graphics manipulation and is extremely efficient at processing large amounts of data in parallel. Since these units have a parallel computation capability inherently, they can provide fast and low cost solutions to high performance computing (HPC). In recent years, GPUs have gained significant popularity in computational fluid dynamics [24, 25, 26] as a cheaper, more efficient, and more accessible alternative to large-scale HPC systems with CPUs. Especially, to numerically study the turbulent physics in much higher Reynolds number turbulence, the extreme-scale DNS in turbulent flows has been implemented using multiple GPUs, i.e., incompressible isotropic turbulence up to 18432318432^{3} resolution [27] and turbulent pipe flow up to Reτ6000Re_{\tau}\approx 6000 [28]. Recently, the three-dimension discontinuous Galerkin based HGKS has been implemented in single-GPU computation using compute unified device architecture (CUDA) [29]. Obtained results are compared with those obtained by Intel i7-9700 CPU using OpenMP directives. The GPU code achieves 6x-7x speedup with TITAN RTX, and 10x-11x speedup with Tesla V100. The numerical results confirm the potential of HGKS for large-scale DNS in turbulence.

A major limitation in single-GPU computation is its available memory, which leads to a bottleneck in the maximum number of computational mesh. In this paper, to implement much larger scale DNS and accelerate the efficiency, HGKS is implemented with multiple GPUs using CUDA and MPI architecture (MPI + CUDA). It is not straightforward for the multiple-GPU programming, since the memory is not shared across the GPUs and their tasks need to be coordinated appropriately. The multiple GPUs are distributed across multiple CPUs at the cost of having to coordinate GPU-GPU communication via MPI. For WENO-based HGKS in single-GPU using CUDA, compared with the CPU code using Intel Core i7-9700, 7x speedup is achieved for TITAN RTX and 16x speedup is achieved for Tesla V100. In terms of the computation with multiple GPUs, the HGKS code scales properly with the increasing number of GPU. Numerical performance shows that the data communication crossing GPU through MPI costs the relative little time, while the computation time for flow field is the dominant one in the HGKS code. For the MPI parallel computation, the computational time of CPU code with supercomputer using 1024 Intel Xeon E5-2692 cores is approximately 33 times longer than that of GPU code using 88 Tesla V100 GPUs. It can be inferred that the efficiency of GPU code with 88 Tesla V100 GPUs approximately equals to that of MPI code using 30003000 CPU cores in supercomputer. To reduce the loading and writing pressure of GPU memory, the benefits can be achieved by using FP32 (single) precision compared with FP64 (double) precision for memory-intensive computing [30, 31]. Thence, HGKS in GPUs is compiled with both FP32 precision and FP64 precision to evaluate the effect of precision on DNS of compressible turbulence. As expect, the efficiency can be improved and the memory cost can be reduced with FP32 precision. However, the difference in accuracy between FP32 and FP64 precision appears for the instantaneous statistical quantities of turbulent channel flows, which is not negligible for long time simulations.

This paper is organized as follows. In Section 2, the high-order gas-kinetic scheme is briefly reviewed. The GPU architecture and code design are introduced in Section 3. Section 4 includes numerical simulation and discussions. The last section is the conclusion.

2 High-order gas-kinetic scheme

The three-dimensional BGK equation [13, 14] can be written as

ft+ufx+vfy+wfz=gfτ,f_{t}+uf_{x}+vf_{y}+wf_{z}=\frac{g-f}{\tau}, (1)

where 𝒖=(u,v,w)T\bm{u}=(u,v,w)^{T} is the particle velocity, ff is the gas distribution function, gg is the three-dimensional Maxwellian distribution and τ\tau is the collision time. The collision term satisfies the compatibility condition

gfτψdΞ=0,\int\frac{g-f}{\tau}\psi\text{d}\Xi=0, (2)

where ψ=(1,u,v,w,12(u2+v2+w2+ξ2))T\displaystyle\psi=(1,u,v,w,\frac{1}{2}(u^{2}+v^{2}+w^{2}+\xi^{2}))^{T}, ξ2=ξ12++ξN2\xi^{2}=\xi_{1}^{2}+...+\xi_{N}^{2}, dΞ=dudvdwdξ1,,dξN\text{d}\Xi=\text{d}u\text{d}v\text{d}w\text{d}\xi_{1},...,\text{d}\xi_{N}, N=(53γ)/(γ1)N=(5-3\gamma)/(\gamma-1) is the internal degree of freedom, γ\gamma is the specific heat ratio and γ=1.4\gamma=1.4 is used in the computation.

In this section, the finite volume scheme on orthogonal structured mesh is provided as example. Taking moments of the BGK equation Eq.(1) and integrating with respect to the cell Ωijk\Omega_{ijk}, the finite volume scheme can be expressed as

d(Qijk)dt=(Qijk),\displaystyle\frac{\text{d}(Q_{ijk})}{\text{d}t}=\mathcal{L}(Q_{ijk}), (3)

where QijkQ_{ijk} is the cell averaged conservative variable over Ωijk\Omega_{ijk}, and the operator \mathcal{L} reads

(Qijk)=1|Ωijk|p=16𝔽p(t).\mathcal{L}(Q_{ijk})=-\frac{1}{|\Omega_{ijk}|}\sum_{p=1}^{6}\mathbb{F}_{p}(t). (4)

where Ωijk\Omega_{ijk} is defined as Ωijk=x¯i×y¯j×z¯k\Omega_{ijk}=\overline{x}_{i}\times\overline{y}_{j}\times\overline{z}_{k} with x¯i=[xiΔx/2,xi+Δx/2],y¯j=[yjΔy/2,yj+Δy/2],z¯k=[zkΔz/2,zk+Δz/2]\overline{x}_{i}=[x_{i}-\Delta x/2,x_{i}+\Delta x/2],\overline{y}_{j}=[y_{j}-\Delta y/2,y_{j}+\Delta y/2],\overline{z}_{k}=[z_{k}-\Delta z/2,z_{k}+\Delta z/2], and 𝔽p(t)\mathbb{F}_{p}(t) is the numerical flux across the cell interface Σp\Sigma_{p}. The numerical flux in xx-direction is given as example

𝔽p(t)=ΣpF(Q)𝒏dσ=m,n=12ωmnψuf(𝒙i+1/2,jm,kn,t,𝒖,ξ)dΞΔyΔz,\displaystyle\mathbb{F}_{p}(t)=\iint_{\Sigma_{p}}F(Q)\cdot\bm{n}\text{d}\sigma=\sum_{m,n=1}^{2}\omega_{mn}\int\psi uf(\bm{x}_{i+1/2,j_{m},k_{n}},t,\bm{u},\xi)\text{d}\Xi\Delta y\Delta z,

where 𝒏\bm{n} is the outer normal direction. The Gaussian quadrature is used over the cell interface, where ωmn\omega_{mn} is the quadrature weight, 𝒙i+1/2,m,n=(xi+1/2,yjm,zkn)\bm{x}_{i+1/2,m,n}=(x_{i+1/2},y_{j_{m}},z_{k_{n}}) and (yjm,zkn)(y_{j_{m}},z_{k_{n}}) is the Gauss quadrature point of cell interface y¯j×z¯k\overline{y}_{j}\times\overline{z}_{k}. Based on the integral solution of BGK equation Eq.(1), the gas distribution function f(𝒙i+1/2,jm,kn,t,𝒖,ξ)f(\bm{x}_{i+1/2,j_{m},k_{n}},t,\bm{u},\xi) in the local coordinate can be given by

f(𝒙i+1/2,jm,kn,t,𝒖,ξ)=1τ0tg(𝒙,t,𝒖,ξ)e(tt)/τdt+et/τf0(𝒖t,ξ),f(\bm{x}_{i+1/2,j_{m},k_{n}},t,\bm{u},\xi)=\frac{1}{\tau}\int_{0}^{t}g(\bm{x}^{\prime},t^{\prime},\bm{u},\xi)e^{-(t-t^{\prime})/\tau}\text{d}t^{\prime}+e^{-t/\tau}f_{0}(-\bm{u}t,\xi), (5)

where 𝒙=𝒙i+1/2,jm,kn𝒖(tt)\bm{x}^{\prime}=\bm{x}_{i+1/2,j_{m},k_{n}}-\bm{u}(t-t^{\prime}) is the trajectory of particles, f0f_{0} is the initial gas distribution function, and gg is the corresponding equilibrium state. With the first order spatial derivatives, the second-order gas distribution function at cell interface can be expressed as

f(𝒙i+1/2,jm,kn,t,𝒖,ξ)=\displaystyle f(\bm{x}_{i+1/2,j_{m},k_{n}},t,\bm{u},\xi)= (1et/τ)g0+((t+τ)et/ττ)(a¯1u+a¯2v+a¯3w)g0\displaystyle(1-e^{-t/\tau})g_{0}+((t+\tau)e^{-t/\tau}-\tau)(\overline{a}_{1}u+\overline{a}_{2}v+\overline{a}_{3}w)g_{0}
+\displaystyle+ (tτ+τet/τ)A¯g0\displaystyle(t-\tau+\tau e^{-t/\tau}){\bar{A}}g_{0}
+\displaystyle+ et/τgr[1(τ+t)(a1ru+a2rv+a3rw)τAr)](1H(u))\displaystyle e^{-t/\tau}g_{r}[1-(\tau+t)(a_{1}^{r}u+a_{2}^{r}v+a_{3}^{r}w)-\tau A^{r})](1-H(u))
+\displaystyle+ et/τgl[1(τ+t)(a1lu+a2lv+a3lw)τAl)]H(u),\displaystyle e^{-t/\tau}g_{l}[1-(\tau+t)(a_{1}^{l}u+a_{2}^{l}v+a_{3}^{l}w)-\tau A^{l})]H(u), (6)

where the equilibrium state g0g_{0} and the corresponding conservative variables Q0Q_{0} can be determined by the compatibility condition

ψg0dΞ=Q0=u>0ψgldΞ+u<0ψgrdΞ.\displaystyle\int\psi g_{0}\text{d}\Xi=Q_{0}=\int_{u>0}\psi g_{l}\text{d}\Xi+\int_{u<0}\psi g_{r}\text{d}\Xi.

The following numerical tests on the compressible turbulent flows without discontinuities will be presented, thus the collision time for the flow without discontinuities takes

τ=μp,\displaystyle\tau=\frac{\mu}{p},

where μ\mu is viscous coefficient and pp is the pressure at cell interface determined by Q0Q_{0}. With the reconstruction of macroscopic variables, the coefficients in Eq.(2) can be fully determined by the reconstructed derivatives and compatibility condition

a1k=Qkx,a2k=Qky,a3k\displaystyle\langle a_{1}^{k}\rangle=\frac{\partial Q_{k}}{\partial x},\langle a_{2}^{k}\rangle=\frac{\partial Q_{k}}{\partial y},\langle a_{3}^{k}\rangle =Qkz,a1ku+a2kv+a3kw+Ak=0,\displaystyle=\frac{\partial Q_{k}}{\partial z},\langle a_{1}^{k}u+a_{2}^{k}v+a_{3}^{k}w+A^{k}\rangle=0,
a¯1=Q0x,a¯2=Q0y,a¯3\displaystyle\langle\overline{a}_{1}\rangle=\frac{\partial Q_{0}}{\partial x},\langle\overline{a}_{2}\rangle=\frac{\partial Q_{0}}{\partial y},\langle\overline{a}_{3}\rangle =Q0z,a¯1u+a¯2v+a¯3w+A¯=0,\displaystyle=\frac{\partial Q_{0}}{\partial z},\langle\overline{a}_{1}u+\overline{a}_{2}v+\overline{a}_{3}w+\overline{A}\rangle=0,

where k=l,rk=l,r and \langle...\rangle are the moments of the equilibrium gg and defined by

=g()ψdΞ.\displaystyle\langle...\rangle=\int g(...)\psi\text{d}\Xi.

More details of the gas-kinetic scheme can be found in the literature [15, 16]. Thus, the numerical flux can be obtained by taking moments of the gas distribution function Eq.(2), and the semi-discretized finite volume scheme Eq.(3) can be fully given.

Recently, based on the time-dependent flux function of the generalized Riemann problem solver (GRP) [17, 18] and gas-kinetic scheme [19, 20, 21], a two-stage fourth-order time-accurate discretization was recently developed for Lax-Wendroff type flow solvers. A reliable framework was provided to construct higher-order gas-kinetic scheme, and the high-order scheme is as robust as the second-order one and works perfectly from the subsonic to hypersonic flows. In this study, the two-stage method is used for temporal accuracy. Consider the following time dependent equation

Qt=(Q),\displaystyle\frac{\partial Q}{\partial t}=\mathcal{L}(Q),

with the initial condition at tnt_{n}, i.e., Q(t=tn)=QnQ(t=t_{n})=Q^{n}, where \mathcal{L} is an operator for spatial derivative of flux, the state Qn+1Q^{n+1} at tn+1=tn+Δtt_{n+1}=t_{n}+\Delta t can be updated with the following formula

Q=Qn+12Δt(Qn)+18Δt2t(Qn),Qn+1=Qn+Δt(Qn)+16Δt2(t(Qn)+2t(Q)).\begin{split}&Q^{*}=Q^{n}+\frac{1}{2}\Delta t\mathcal{L}(Q^{n})+\frac{1}{8}\Delta t^{2}\partial_{t}\mathcal{L}(Q^{n}),\\ Q^{n+1}=&Q^{n}+\Delta t\mathcal{L}(Q^{n})+\frac{1}{6}\Delta t^{2}\big{(}\partial_{t}\mathcal{L}(Q^{n})+2\partial_{t}\mathcal{L}(Q^{*})\big{)}.\end{split} (7)

It can be proved that for hyperbolic equations the above temporal discretization Eq.(7) provides a fourth-order time accurate solution for Qn+1Q^{n+1}. To implement two-stage fourth-order method for Eq.(3), a linear function is used to approximate the time dependent numerical flux

𝔽p(t)𝔽pn+t𝔽pn(ttn).\displaystyle\mathbb{F}_{p}(t)\approx\mathbb{F}_{p}^{n}+\partial_{t}\mathbb{F}_{p}^{n}(t-t_{n}). (8)

Integrating Eq.(8) over [tn,tn+Δt/2][t_{n},t_{n}+\Delta t/2] and [tn,tn+Δt][t_{n},t_{n}+\Delta t], we have the following two equations

𝔽pnΔt\displaystyle\mathbb{F}_{p}^{n}\Delta t +12t𝔽pnΔt2=tntn+Δt𝔽p(t)dt,\displaystyle+\frac{1}{2}\partial_{t}\mathbb{F}_{p}^{n}\Delta t^{2}=\int_{t_{n}}^{t_{n}+\Delta t}\mathbb{F}_{p}(t)\text{d}t,
12𝔽pnΔt\displaystyle\frac{1}{2}\mathbb{F}_{p}^{n}\Delta t +18t𝔽pnΔt2=tntn+Δt/2𝔽p(t)dt.\displaystyle+\frac{1}{8}\partial_{t}\mathbb{F}_{p}^{n}\Delta t^{2}=\int_{t_{n}}^{t_{n}+\Delta t/2}\mathbb{F}_{p}(t)\text{d}t.

The coefficients 𝔽pn\mathbb{F}_{p}^{n} and t𝔽pn\partial_{t}\mathbb{F}_{p}^{n} at the initial stage can be determined by solving the linear system. According to Eq.(4), (Qin)\mathcal{L}(Q_{i}^{n}) and the temporal derivative t(Qin)\partial_{t}\mathcal{L}(Q_{i}^{n}) at tnt^{n} can be constructed by

(Qin)\displaystyle\mathcal{L}(Q_{i}^{n}) =1|Ωi|p=16𝔽pn,t(Qin)=1|Ωi|p=16t𝔽pn.\displaystyle=-\frac{1}{|\Omega_{i}|}\sum_{p=1}^{6}\mathbb{F}_{p}^{n},~{}~{}\partial_{t}\mathcal{L}(Q_{i}^{n})=-\frac{1}{|\Omega_{i}|}\sum_{p=1}^{6}\partial_{t}\mathbb{F}_{p}^{n}.

The flow variables QQ^{*} at the intermediate stage can be updated. Similarly, (Qi)\mathcal{L}(Q_{i}^{*}), t(Qi)\partial_{t}\mathcal{L}(Q_{i}^{*}) at the intermediate state can be constructed and Qn+1Q^{n+1} can be updated as well. For the high-order spatial accuracy, the fifth-order WENO method [32, 33] is adopted. For the three-dimensional computation, the dimension-by-dimension reconstruction is used for HGKS [20].

3 HGKS code design on GPU

CPUs and GPUs are equipped with different architectures and are built for different purposes. The CPU is suited to general workloads, especially those for which per-core performance are important. CPU is designed to execute complex logical tasks quickly, but is limited in the number of threads. Meanwhile, GPU is a form of hardware acceleration, which is originally developed for graphics manipulation and is extremely efficient at processing large amounts of data in parallel. Currently, GPU has gained significant popularity in high performance scientific computing [24, 25, 26]. In this paper, to accelerate the computation, the WENO based HGKS will be implemented with single GPU using CUDA. To conduct the large-scale DNS in turbulence efficiently, the HGKS also be further upgraded with multiple GPUs using MPI + CUDA.

3.1 Single-GPU accelerated HGKS

The CPU is regarded as host, and GPU is treated as device. Data-parallel, compute-intensive operations running on the host are transferred to device by using kernels, and kernels are executed on the device by many different threads. For CUDA, these threads are organized into thread blocks, and thread blocks constitute a grid. Such computational structures build connection with Nvidia GPU hardware architecture. The Nvidia GPU consists of multiple streaming multiprocessors (SMs), and each SM contains streaming processors (SPs). When invoking a kernel, the blocks of grid are distributed to SMs, and the threads of each block are executed by SPs. In summary, the correspondence between software CUDA and hardware Nvidia GPU is simply shown in Fig.1. In the following, the single-GPU accelerated HGKS will be introduced briefly.

Refer to caption
Figure 1: Correspondence between software CUDA and hardware GPU.
   Initialization
  while TIME \leq TSTOP do
     dimGrid == dim3(NzN_{z}, Nx/blockxN_{x}/\text{block}_{x}, 1)
     dimBlock == dim3(blockx\text{block}_{x},NyN_{y},1)
     STEP 1 : Calculation of time step
     CALL GETTIMESTEP<<<<<<dimGrid,dimBlock>>>>>>
     istat=cudaDeviceSynchronize()
     STEP 2 : WENO reconstruction
     CALL WENO-x<<<<<<dimGrid,dimBlock>>>>>>
     istat=cudaDeviceSynchronize()
     STEP 3 : Computation of flux
     CALL FLUX-x<<<<<<dimGrid,dimBlock>>>>>>
     istat=cudaDeviceSynchronize()  % kernel for WENO reconstruction and flux calculation in xx direction as example.% reconstruction and flux calculation in yy and zz directions can be also implemented.
     STEP 4 : Update of flow variables
  end while
Algorithm 1 Single-GPU HGKS code using CUDA
Refer to caption
Figure 2: Specifications of grid for the three-dimensional structured meshes.

Device implementation is handled automatically by the CUDA, while the kernels and grids need to be specified for HGKS. One updating step of single-GPU accelerated HGKS code can be divided into several kernels, namely, initialization, calculation of time step, WENO reconstruction, flux computation at cell interface in xx, yy, zz, and the update of flow variables. The main parts of the single-GPU accelerated HGKS code are labeled in blue as Algorithm.1. Due to the identical processes of calculation for each kernel, it is natural to use one thread for one computational cell when setting grids. Due to the limited number of threads can be obtained in one block (i.e., the maximum 10241024 threads in one block), the whole computational domain is required to be divided into several parts. Assume that the total number of cells for computational mesh is Nx×Ny×NzN_{x}\times N_{y}\times N_{z}, which is illustrated in Fig.2. The computational domain is divided into Nx/blockxN_{x}/\text{block}_{x} parts in xx-direction, where the choice of blockx\text{block}_{x} for each kernel is an integer defined according to the requirement and experience. If NxN_{x} is not divisible by blockx\text{block}_{x}, an extra block is needed. When setting grid, the variables ”dimGrid” and ”dimBlock” are defined as

dimGrid\displaystyle{\rm dimGrid} =dim3(Nz,Nx/blockx,1),\displaystyle={\rm dim3}(N_{z},N_{x}/\text{block}_{x},1),
dimBlock\displaystyle{\rm dimBlock} =dim3(blockx,Ny,1).\displaystyle={\rm dim3}(\text{block}_{x},N_{y},1).

As shown in Fig.2, one thread is assigned to complete the computations of a cell Ωijk\Omega_{ijk}, and the one-to-one correspondence of thread index threadidx, block index blockidx and cell index (i,j,k)(i,j,k) is given by

i\displaystyle i =threadidx%x+blockx(blockidx%y1),\displaystyle=\text{threadidx}\%\text{x}+\text{block}_{x}*(\text{blockidx}\%\text{y}-1),
j\displaystyle j =threadidx%y,\displaystyle=\text{threadidx}\%\text{y},
k\displaystyle k =blockidx%x.\displaystyle=\text{blockidx}\%\text{x}.

Thus, the single-GPU accelerated HGKS code can be implemented after specifying kernels and grids.

Refer to caption
Figure 3: The domain decomposition and GPU-GPU communications for multiple-GPU implementation.

3.2 Multiple-GPU accelerated HGKS

Due to the limited computational resources of single-GPU, it is natural to develop the multiple-GPU accelerated HGKS for large-scale DNS of turbulence. The larger computational scale can be achieved with the increased available device memories. It is not straightforward for the programming with multiple-GPUs, because the device memories are not shared across GPUs and the tasks need to be coordinated appropriately. The computation with multiple GPUs is implemented using multiple CPUs and multiple GPUs, where the GPU-GPU communication is coordinated via the MPI. CUDA-aware MPI library is chosen [34], where GPU data can be directly passed by MPI function and transferred in the efficient way automatically.

  if (i=0i=0  or  i=N1i=N-1then
      Boundary conditions for P0P_{0} and PN1P_{N-1}% for wall boundary, boundary condition can be given directly.% for periodic boundary, MPI_\_SEND and MPI_\_RECV are also needed.
  end if
  if (i=i= even number) then
     MPI_\_SEND data from PiP_{i} to Pi1P_{i-1},
     MPI_\_RECV data from Pi1P_{i-1} to PiP_{i}.
  else
     MPI_\_RECV data from Pi+1P_{i+1} to PiP_{i},
     MPI_\_SEND data from PiP_{i} to Pi+1P_{i+1}
  end if
  if (i=i= odd number) then
     MPI_\_SEND data from PiP_{i} to Pi1P_{i-1},
     MPI_\_RECV data from Pi1P_{i-1} to PiP_{i}.
  else
     MPI_\_RECV data from Pi+1P_{i+1} to PiP_{i},
     MPI_\_SEND data from PiP_{i} to Pi+1P_{i+1}
  end if
Algorithm 2 Data communication for MPI+CUDA
Refer to caption
Figure 4: Multiple-GPU accelerated HGKS code using MPI+CUDA.

For the parallel computation with MPI, the domain decomposition are required for both CPU and GPU, where both one-dimensional (slab) and two-dimensional (pencil) decomposition are allowed. For CPU computation with MPI, hundreds or thousands of CPU cores are usually used in the computation, and the pencil decomposition can be used to improve the efficiency of data communications [23]. In terms of GPU computation with MPI, the slab decomposition is used, which reduces the frequency of GPU-GPU communications and improves performances. The domain decomposition and GPU-GPU communications are presented in Fig.3, where the computational domain is divided into NN parts in xx-direction and NN GPUs are used. For simplicity, PiP_{i} denotes the ii-th MPI process, which deals with the tasks of ii-th decomposed computational domain in the ii-th GPU. To implement the WENO reconstruction, the processor PiP_{i} exchanges the data with Pi1P_{i-1} and Pi+1P_{i+1} using MPI_\_RECV and MPI_\_SEND. The wall boundary condition of P0P_{0} and PN1P_{N-1} can be implemented directly. For the periodic boundary, the data is exchanged between P0P_{0} and PN1P_{N-1} using MPI_\_RECV and MPI_\_SEND as well. In summary, the detailed data communication for GPU computation with MPI are shown in Algorithm.2. Fig.4 shows the HGKS code frame with multiple-GPUs. The initial data will be divided into NN parts according to domain decomposition and sent to corresponding process P1P_{1} to PN1P_{N-1} using MPI_\_SEND. As for input and output, the MPI process P0P_{0} is responsible for distributing data to other processors and collecting data from other processors, using MPI_\_SEND and MPI_\_RECV. For each decomposed domain, the computation is executed with the prescribed GPU. The identical code is running for each GPU, where kernels and grids are similarly set as single-GPU code [29].

3.3 Detailed parameters of CPUs and GPUs

In following studies, the CPU codes run on the desktop with Intel Core i7-9700 CPU using OpenMP directives and TianHe-II supercomputer system with MPI and Intel mpiifort compiler, The HGKS codes with CPU are all compiled with FP64 precision. For the GPU computations, the Nvidia TITAN RTX GPU and Nvidia Tesla V100 GPU are used with Nvidia CUDA and Nvidia HPC SDK. The detailed parameters of CPU and GPU are given in Table.2 and Table.2, respectively. For TITAN RTX GPU, the GPU-GPU communication is achieved by connection traversing PCIe, and there are 22 TITAN TRTX GPUs in one node. For Tesla V100 GPU, there are 88 GPUs inside one GPU node, and more nodes are needed for more than 88 GPUs. The GPU-GPU communication in one GPU node is achieved by Nvidia NVLink. The communication across GPU nodes can be achieved by Remote Direct Memory Access (RDMA) technique, including iWARP, RDMA over Converged Ethernet (RoCE) and InfiniBand. In this paper, RoCE is used for communication across GPU nodes.

As shown in Table.2, Tesla V100 is equipped with more FP64 precision cores and much stronger FP64 precision performance than TITAN RTX. For FP32 precision performance, two GPUs are comparable. Nvidia TITAN RTX is powered by Turing architecture, while Nvidia Tesla V100 is built to HPC by advanced Ampere architecture. Additionally, Tesla V100 outweighs TITAN RTX in GPU memory size and memory bandwidth. Thence, much excellent performance is excepted in FP64 precision simulation with Tesla V100. To validate the performance with FP32 and FP64 precision, numerical examples will be presented.

CPU Clock rate Memory size
Desktop Intel Core i7-9700 3.03.0 GHz 1616 GB/88 cores
TianHe-II Intel Xeon E5-2692 v2 2.22.2 GHz 6464 GB/2424 cores
Table 1: The detailed parameters of CPUs.
Nvidia TITAN RTX Nvidia Tesla V100
Clock rate 1.77 GHz 1.53 GHz
Stream multiprocessor 72 80
FP64 core per SM 2 32
FP32 precision performance 16.3 Tflops 15.7 Tflops
FP64 precision performance 509.8 Gflops 7834 Gflops
GPU memory size 24 GB 32 GB
Memory bandwidth 672 GB/s 897 GB/s
Table 2: The detailed parameters of GPUs.

4 Numerical tests and discussions

Benchmarks for compressible turbulent flows, including Taylor-Green vortex and turbulent channel flows, are presented to validate the performance of multiple-GPU accelerated HGKS. In this section, we mainly concentrate on the computation with single-GPU and multiple-GPUs. The GPU-CPU comparison in terms of computational efficiency will be presented firstly, and the scalability of parallel multiple-GPU code will also be studied. Subsequently, HGKS implemented with GPUs is compiled with both FP32 and FP64 precision to evaluate the effect of precision on DNS of compressible turbulence. The detailed comparison on numerical efficiency and accuracy of statistical turbulent quantities with FP32 precision and FP64 precision is also presented.

4.1 Taylor-Green vortex for GPU-CPU comparison

Taylor-Green vortex (TGV) is a classical problem in fluid dynamics developed to study vortex dynamics, turbulent decay and energy dissipation process [35, 36, 37]. In this case, the detailed efficiency comparisons of HGKS running on the CPU and GPU will be given. The flow is computed within a periodic square box defined as πLx,y,zπL-\pi L\leq x,y,z\leq\pi L. With a uniform temperature, the initial velocity and pressure are given by

U=\displaystyle U= V0sin(xL)cos(yL)cos(zL),\displaystyle V_{0}\sin(\frac{x}{L})\cos(\frac{y}{L})\cos(\frac{z}{L}),
V=\displaystyle V= V0cos(xL)sin(yL)cos(zL),\displaystyle-V_{0}\cos(\frac{x}{L})\sin(\frac{y}{L})\cos(\frac{z}{L}),
W=\displaystyle W= 0,\displaystyle 0,
p=\displaystyle p= p0+ρ0V0216(cos(2xL)+cos(2yL))(cos(2zL)+2).\displaystyle p_{0}+\frac{\rho_{0}V_{0}^{2}}{16}(\cos(\frac{2x}{L})+\cos(\frac{2y}{L}))(\cos(\frac{2z}{L})+2).

In the computation, L=1,V0=1,ρ0=1L=1,V_{0}=1,\rho_{0}=1, and the Mach number takes Ma=V0/c0=0.1Ma=V_{0}/c_{0}=0.1, where c0c_{0} is the sound speed. The fluid is a perfect gas with γ=1.4\gamma=1.4, Prandtl number Pr=1.0Pr=1.0, and Reynolds number Re=1600Re=1600. The characteristic convective time tc=L/V0t_{c}=L/V_{0}.

Mesh size Time step i7-9700 TITAN RTX SgcS_{gc} Tesla V100 SgcS_{gc}
64364^{3} 3.571×1033.571\times 10^{-3} 0.757 0.126 6.0 0.047 16.1
1283128^{3} 1.785×1031.785\times 10^{-3} 11.421 1.496 7.6 0.714 16.0
2563256^{3} 8.925×1048.925\times 10^{-4} 182.720 24.256 7.4 11.697 15.4
Table 3: Taylor-Green vortex: the total execution times and speedup for single-GPU versus CPU.

In this section, GPU codes are all compiled with FP64 precision. To test the performance of single-GPU code, this case is run with Nvidia TITAN RTX and Nvidia Tesla V100 GPUs using Nvidia CUDA. As comparison, the CPU code running on the Intel Core i7-9700 is also tested. The execution times with different meshes are shown in Table.3, where the total execution time for CPU and GPUs are given in terms of hours and 2020 characteristic convective time are simulated. Due to the limitation of single GPU memory size, the uniform meshes with 64364^{3}, 1283128^{3} and 2563256^{3} cells are used. For the CPU computation, 88 cores are utilized with OpenMP parallel computation, and the corresponding execution time T8,CPUtotalT_{8,CPU}^{total} is used as the base for following comparisons. The speedups SgcS_{gc} are also given in Table.3, which is defined as

Sgc=T8,CPUtotalT1,GPUtotal,\displaystyle{S_{gc}}=\frac{T_{8,CPU}^{total}}{T_{1,GPU}^{total}},

where T8,CPUtotalT_{8,CPU}^{total} and T1,GPUtotalT_{1,GPU}^{total} denote the total execution time of 88 CPU cores and 11 GPU, respectively. Compared with the CPU code, 7x speedup is achieved by single TITAN RTX GPU and 16x speedup is achieved by single Tesla V100 GPU. Even though the Tesla V100 is approximately 1515 times faster than the TITAN RTX in FP64 precision computation ability, Tesla V100 only performs 22 times faster than TITAN RTX. It indicates the memory bandwidth is the bottleneck for current memory-intensive simulations, and the detailed technique for GPU programming still required to be investigated.

Mesh size Time step No. CPUs Tn,CPUtotalT_{n,CPU}^{total}
1283128^{3} 1.785×1031.785\times 10^{-3} 16 13.3
2563256^{3} 8.925×1048.925\times 10^{-4} 256 13.5
5123512^{3} 4.462×1044.462\times 10^{-4} 1024 66
Table 4: Taylor-Green vortex: the detailed computational parameters in TianHe-II supercomputers.
Mesh size No. GPUs Tn,GPUtotalT_{n,GPU}^{total} Tn,GPUflowT_{n,GPU}^{flow} Tn,GPUcomT_{n,GPU}^{com}
2563256^{3} 1 11.697
2563256^{3} 2 6.201 6.122 0.079
2563256^{3} 4 3.223 2.997 0.226
2563256^{3} 8 1.987 1.548 0.439
5123512^{3} 8 22.252 21.383 0.869
5123512^{3} 12 19.055 17.250 1.805
5123512^{3} 16 13.466 11.483 1.983
Table 5: Taylor-Green vortex: the detailed computational parameters with Tesla V100.

For single TITAN TRX with 2424 GB memory size, the maximum computation scale is 2563256^{3} cells. To enlarge the computational scale, the multiple-GPU accelerated HGKS are designed. Meanwhile, the parallel CPU code has been run on the TianHe-II supercomputer [23], and the execution times with respect to the number of CPU cores are presented in Table.4 for comparison. For the case with 2563256^{3} cells, the computational time of single Tesla V100 GPU is comparable with the MPI code with supercomputer using approximately 300300 Intel Xeon E5-2692 cores. For the case with 5123512^{3} cells, the computational time of CPU code with supercomputer using 1024 Intel Xeon E5-2692 cores is approximately 33 times longer than that of GPU code using 88 Tesla V100 GPUs. It can be inferred that the efficiency of GPU code with 88 Tesla V100 GPUs approximately equals to that of MPI code with 30003000 supercomputer cores, which agree well with the multiple-GPU accelerated finite difference code for multiphase flows [26]. These comparisons shows the excellent performance of multiple-GPU accelerated HGKS for large-scale turbulence simulation. To further show the performance of GPU computation, the scalability is defined as

Sn=Tn,GPUtotalT1,GPUtotal.\displaystyle S_{n}=\frac{T_{n,GPU}^{total}}{T_{1,GPU}^{total}}. (9)

Fig.6 shows the log-log plot for nn and SnS_{n}, where 22, 44 and 88 GPUs are used for the case with 2563256^{3} cells, while 88, 1212 and 1616 GPUs are used for the case with 5123512^{3} cells. The ideal scalability of parallel computations would be equal to nn. With the log-log plot for nn and Tn,GPUtotalT_{n,GPU}^{total}, an ideal scalability would follow 1-1 slope. However, such idea scalability is not possible due to the communication delay among the computational cores and the idle time of computational nodes associated with load balancing. As expected, the explicit formulation of HGKS scales properly with the increasing number of GPU. Conceptually, the total computation amount increases with a factor of 1616, when the number of cells doubles in every direction. Taking the communication delay into account, the execution time of 2563256^{3} cells with 1 GPU approximately equals to that of 5123512^{3} cells with 16 GPUs, which also indicates the scalability of GPU code. When GPU code using more than 88 GPUs, the communication across GPU nodes with RoCE is required, which accounts for the worse scalability using 1212 GPUs and 1616 GPUs. Table.5 shows the execution time in flow solver Tn,GPUflowT_{n,GPU}^{flow} and the communication delay between CPU and GPU Tn,GPUcomT_{n,GPU}^{com}. Specifically, Tn,GPUflowT_{n,GPU}^{flow} is consist of the time of WENO reconstruction, flux calculation at cell interface and update of flow variables, while Tn,GPUcomT_{n,GPU}^{com} concludes time for MPI_\_RECV and MPI_\_SEND for initialization and boundary exchange, and MPI_\_REDUCE for global time step. The histogram of Tn,GPUcomT_{n,GPU}^{com} and Tn,GPUflowT_{n,GPU}^{flow} with different number of GPU is shown in Fig.6 with 2563256^{3} and 5123512^{3} cells. With the increase of GPU, more time for communication is consumed and the parallel efficiency is reduced accounting for the practical scalability in Fig.6. Especially, the communication across GPU nodes with RoCE is needed for the GPU code using more than 88 GPUs, which consumes longer time than the communication in single GPU node with NVLink. The performance of communication across GPU nodes with InfiniBand will be tested, which is designed for HPC center to achieve larger throughout among GPU nodes.

Refer to caption
Figure 5: Taylor-Green vortex: speedup and efficiency of Tesla V100 GPUs with 2563256^{3} and 5123512^{3} uniform cells.
Refer to caption
Refer to caption
Figure 6: Taylor-Green vortex: the histogram of TcomnT_{com}^{n} and TflownT_{flow}^{n} using different number of Tesla V100 GPUs with 2563256^{3} (left) and 5123512^{3} (right) uniform cells.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: Taylor-Green vortex: top view of Q-criterion with Mach number Ma=0.25Ma=0.25, 0.50.5, 0.750.75 and 1.01.0 at t=10t=10 using 2563256^{3} cells.

The time history of kinetic energy EkE_{k}, dissipation rate ε(Ek)\varepsilon(E_{k}) and enstrophy dissipation rate ε(ζ)\varepsilon(\zeta) of the above nearly incompressible case from GPU code is identical with the solution from previous CPU code, and more details can be found in [23]. With the efficient multiple-GPU accelerated platform, the compressible Taylor-Green vortexes with fixed Reynolds number Re=1600Re=1600 Mach number Ma=0.25Ma=0.25, 0.50.5, 0.750.75 and 1.01.0 are also tested. Top view of Q-criterion with Mach number Ma=0.25Ma=0.25, 0.50.5, 0.750.75 and 1.01.0 at t=10t=10 using 2563256^{3} cells are presented in Fig.7. The volume-averaged kinetic energy is given by

Ek=1ρ0ΩΩ12ρ𝑼𝑼dΩ,\displaystyle E_{k}=\frac{1}{\rho_{0}\Omega}\int_{\Omega}\frac{1}{2}\rho\bm{U}\cdot\bm{U}\text{d}\Omega,

where Ω\Omega is the volume of the computational domain. The total viscous dissipation is defined as

εcom=μρ0ΩΩ𝝎𝝎dΩ+43μρ0ΩΩ(𝑼)2dΩ,\displaystyle\varepsilon_{com}=\frac{\mu}{\rho_{0}\Omega}\int_{\Omega}\bm{\omega}\cdot\bm{\omega}\text{d}\Omega+\frac{4}{3}\frac{\mu}{\rho_{0}\Omega}\int_{\Omega}(\nabla\cdot\bm{U})^{2}\text{d}\Omega,

where μ\mu is the coefficient of viscosity and 𝝎=×𝑼\bm{\omega}=\nabla\times\bm{U}. The time histories of kinetic energy and total viscous dissipation are given in Fig.8. Compared with the low Mach number cases, the transonic and supersonic cases with Ma=0.75Ma=0.75 and Ma=1.0Ma=1.0 have regions of increasing kinetic energy rather than a monotone decline. This is consistent with the compressible TGV simulations [38], where increases in the kinetic energy were reported during the time interval 2t42\leq t\leq 4. A large spread of total viscous dissipation curves is observed in Fig.8, with the main trends being a delay and a flattening of the peak dissipation rate for increasing Mach numbers. The peak viscus dissipation rate for the Ma=0.25Ma=0.25 curve matches closely to that of Ma=0.1Ma=0.1. Meanwhile, for the higher Mach number cases, the viscous dissipation rate is significantly higher in the late evolution region (i.e., t11t\geq 11) than that from near incompressible case.

Refer to caption
Refer to caption
Figure 8: Taylor-Green vortex: the time history of kinetic energy EkE_{k} and viscous dissipation εcom\varepsilon_{com} with Mach number 0.10.1, 0.250.25, 0.50.5, 0.750.75 and 1.01.0 using 2563256^{3} cells.

4.2 Turbulent channel flows for FP32-FP64 comparison

For most GPUs, the performance of FP32 precision is stronger than FP64 precision. In this case, to address the effect of FP32 and FP64 precision in terms of efficiency, memory costs and simulation accuracy, both the near incompressible and compressible turbulent channel flows are tested. Incompressible and compressible turbulent channel flows have been studied to understand the mechanism of wall-bounded turbulent flows [2, 4, 8]. In the computation, the physical domain is (x,y,z)[0,2πH]×[H,H]×[0,πH](x,y,z)\in[0,2\pi H]\times[-H,H]\times[0,\pi H], and the computational mesh is given by the coordinate transformation

{x=ξ,y=tanh(bg(η1.5π1))/tanh(bg),z=ζ,\displaystyle\begin{cases}\displaystyle x=\xi,\\ \displaystyle y=\tanh(b_{g}(\frac{\eta}{1.5\pi}-1))/\tanh(b_{g}),\\ \displaystyle z=\zeta,\end{cases}

where the computational domain takes (ξ,η,ζ)[0,2πH]×[0,3πH]×[0,πH](\xi,\eta,\zeta)\in[0,2\pi H]\times[0,3\pi H]\times[0,\pi H] and bg=2b_{g}=2. The fluid is initiated with density ρ=1\rho=1 and the initial streamwise velocity U(y)U(y) profile is given by the perturbed Poiseuille flow profile, where the white noise is added with 10%10\% amplitude of local streamwise velocity. The spanwise and wall-normal velocity is initiated with white noise. The periodic boundary conditions are used in streamwise xx-direction and spanwise zz-directions, and the non-slip and isothermal boundary conditions are used in vertical yy-direction. The constant moment flux in the streamwise direction is used to determine the external force [23]. In current study, the nearly incompressible turbulent channel flow with friction Reynolds number Reτ=395Re_{\tau}=395 and Mach number Ma=0.1Ma=0.1, and the compressible turbulent channel flow with bulk Reynolds number Re=3000Re=3000 and bulk Mach number Ma=0.8Ma=0.8 are simulated with the same set-up as previous studies [23, 39]. For compressible case, the viscosity μ\mu is determined by the power law as μ=μw(T/Tw)0.7\mu=\mu_{w}(T/T_{w})^{0.7}, and the Prandtl number Pr=0.7Pr=0.7 and γ=1.4\gamma=1.4 are used. For the nearly incompressible turbulent channel flow, two cases G1G_{1} and G2G_{2} are tested, where 256×128×128256\times 128\times 128 and 256×256×128256\times 256\times 128 cells are distributed uniformly in the computational space as Table.7. For the compressible flow, two cases H1H_{1} and H2H_{2} are tested as well, where 128×128×128128\times 128\times 128 and 128×256×128128\times 256\times 128 cells are distributed uniformly in the computational space as Table.7. In the computation, the cases G1G_{1} and H1H_{1} are implemented with single GPU, while the cases G2G_{2} and H2H_{2} are implemented with double GPUs. Specifically, Δymin+\Delta y^{+}_{min} and Δymax+\Delta y^{+}_{max} are the minimum and maximum mesh space in the yy-direction. The equidistant meshes are used in xx and zz directions, and Δx+\Delta x^{+} and Δz+\Delta z^{+} are the equivalent plus unit, respectively. The fine meshes arrangement in case G2G_{2} and H2H_{2} can meet the mesh resolution for DNS in turbulent channel flows. The QQ-criterion iso-surfaces for G2G_{2} and H2H_{2} are given Fig.9 in which both cases can well resolve the vortex structures. With the mesh refinement, the high-accuracy scheme corresponds to the resolution of abundant turbulent structures.

Case Mesh size Δymin+\Delta y^{+}_{min}/Δymax+\Delta y^{+}_{max} Δx+\Delta x^{+} Δz+\Delta z^{+}
G1G_{1} 256×128×128256\times 128\times 128 0.93/12.80 9.69 9.69
G2G_{2} 256×256×128256\times 256\times 128 0.45/6.40 9.69 9.69
Table 6: Turbulent channel flow: the details of mesh for nearly incompressible flow with friction Reynolds number Reτ=395Re_{\tau}=395 and Ma=0.1Ma=0.1.
Case Mesh size Δymin+\Delta y^{+}_{min}/Δymax+\Delta y^{+}_{max} Δx+\Delta x^{+} Δz+\Delta z^{+}
H1H_{1} 128×128×128128\times 128\times 128 0.43/5.80 8.76 4.38
H2H_{2} 128×256×128128\times 256\times 128 0.22/2.90 8.76 4.38
Table 7: Turbulent channel flow: the details of mesh for compressible flow with bulk Reynolds number Re=3000Re=3000 and bulk Mach number Ma=0.8Ma=0.8.
Refer to caption
Refer to caption
Figure 9: Turbulent channel flow: the instantaneous QQ-criterion iso-surfaces with G2G_{2} (left) and H2H_{2} (right).
TITAN RTX T2,GPUtotalT_{2,GPU}^{total} with FP32 T2,GPUtotalT_{2,GPU}^{total} with FP64 SfpS_{fp}
G1G_{1} 0.414 1.152 2.783
G2G_{2} 1.635 4.366 2.670
TITAN RTX memory cost with FP32 memory cost with FP64 RfpR_{fp}
G1G_{1} 2.761 5.355 1.940
G2G_{2} 5.087 10.009 1.968
Tesla V100 T2,GPUtotalT_{2,GPU}^{total} with FP32 T2,GPUtotalT_{2,GPU}^{total} with FP64 SfpS_{fp}
G1G_{1} 0.377 0.642 1.703
G2G_{2} 1.484 2.607 1.757
Tesla V100 memory cost with FP32 memory cost with FP64 RfpR_{fp}
G1G_{1} 3.064 5.906 1.928
G2G_{2} 5.390 10.559 1.959
Table 8: Turbulent channel flow: the comparison of computational times and memory for FP32 and FP64 precision.

Because of the reduction in device memory and improvement of arithmetic capabilities on GPUs, the benefits can be achieved by using FP32 precision compared to FP64 precision. In view of these strength, FP32-based and mixed-precision-based high-performance computing start to be explored [30, 31]. However, whether FP32 precision is sufficient for the DNS of turbulent flows still needs to be discussed. It is necessary to evaluate the effect of precision on DNS of compressible turbulence. Thus, the GPU-accelerated HGKS is compiled with FP32 precision and FP64 precision, and both nearly incompressible and compressible turbulent channel flows are used to test the performance in different precision. For simplicity, only the memory cost and computational time for the nearly incompressible cases are provided in Table.8, where the execution times T2,GPUtotalT_{2,GPU}^{total} are given in terms of hours and the memory cost is in GB. In the computation, two GPUs are used and 10H/Uc10H/U_{c} statistical period is simulated. In Table.8, SfpS_{fp} represents the speed up of T2,GPUtotalT_{2,GPU}^{total} with FP32 to T2,GPUtotalT_{2,GPU}^{total} with FP64, and RfpR_{fp} denotes the ratio of memory cost with FP32 to memory cost with FP64. As expected, the memory of FP32 precision is about half of that of FP64 precision for both TITAN RTX and Tesla V100 GPUs. Compared with FP64-based simulation, 2.7x speedup is achieved for TITAN RTX GPU and 1.7x speedup is achieved for Tesla V100 GPU. As shown in Table.8, the comparable FP32 precision performance between TITAN RTX and Tesla V100 also provides comparable T2,GPUtotalT_{2,GPU}^{total} for these two GPUs. We conclude that Turing architecture and Ampere architecture perform comparably for memory-intensive computing tasks in FP32 precision.

Refer to caption
Figure 10: Turbulent channel flow: the mean velocity for nearly incompressible turbulent flow. The reference data is given in [4].
Refer to caption
Refer to caption
Figure 11: Turbulent channel flow: the mean velocity U+\langle U\rangle^{+} and velocity with VD transformation UVD+\langle U\rangle_{VD}^{+} for compressible turbulent flow. The reference data is given in [9].
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 12: Turbulent channel flow: the root-mean-square fluctuation velocity Urms+U_{rms}^{+}, Vrms+V_{rms}^{+}, Wrms+W_{rms}^{+} and Reynolds stress <UV>-<U^{\prime}V^{\prime}> profiles for nearly incompressible turbulent flow. The reference data is given in [4].
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 13: Turbulent channel flow: the root-mean-square fluctuation velocity Urms+U_{rms}^{+}, Vrms+V_{rms}^{+}, Wrms+W_{rms}^{+}, Reynolds stress ρUV/τw\langle-\rho U^{{}^{\prime}}V^{{}^{\prime}}\rangle/\langle\tau_{w}\rangle, the root-mean-square of Mach number Mrms+M_{rms}^{+} and turbulent Mach number MtM_{t} profiles for compressible turbulent flow. The reference data is given in [9].

To validate the accuracy of HGKS and address the effect with different precision, the statistical turbulent quantities are provided for the nearly incompressible and compressible turbulent channel flows. For FP32-FP64 comparison, all cases restarted from the same flow fields, and 200H/Uc200H/U_{c} statistical periods are used to average the computational results for all cases. For the nearly incompressible channel flow, the logarithmic formulation is given by

U+=1κlnY++B,\displaystyle U^{+}=\frac{1}{\kappa}\ln Y^{+}+B, (10)

where the von Karman constant κ=0.40\kappa=0.40 and B=5.5B=5.5 is given for the low Reynolds number turbulent channel flow [2]. The mean streamwise velocity profiles with a log-linear plot are given in Fig.11, where the HGKS result is in reasonable agreement with the spectral results [4]. In order to account for the mean property of variations caused by compressibility, the Van Driest (VD) transformation [40] for the density-weighted velocity is considered

UVD+=0U+(ρρw)1/2dU+,\displaystyle{\left\langle U\right\rangle}_{VD}^{+}=\int_{0}^{{\left\langle U\right\rangle}^{+}}\large\big{(}\frac{\left\langle\rho\right\rangle}{\left\langle\rho_{w}\right\rangle}\big{)}^{1/2}\text{d}{\left\langle U\right\rangle}^{+}, (11)

where \langle\cdot\rangle represents the mean average over statistical periods and the X- and Z-directions. For the compressible flow, the transformed velocity is expected to satisfy the incompressible log law Eq.(10). The mean streamwise velocity profiles U+\langle U\rangle^{+} and UVD+\langle U\rangle_{VD}^{+} with VD transformation as Eq.(11) are given in Fig.11 in log-linear plot. HGKS result is in reasonable agreement with the reference DNS solutions with the mesh refinements [9]. The mean velocity with FP32 and FP64 precision are also given in Fig.11 and Fig.11. It can be observed that the qualitative differences between different precision are negligible for statistical turbulent quantities with long time averaging.

For the nearly incompressible turbulent channel flow, the time averaged normalized root-mean-square fluctuation velocity profile Urms+U_{rms}^{+}, Vrms+V_{rms}^{+}, Wrms+W_{rms}^{+} and normalized Reynolds stress profiles <UV>-<U^{\prime}V^{\prime}> are given in Fig.12 for the cases G1G_{1} and G2G_{2}. The root mean square is defined as ϕrms+=(ϕϕ)2\phi_{rms}^{+}=\sqrt{(\phi-\langle\phi\rangle)^{2}} and ϕ=ϕϕ\phi^{\prime}=\phi-\langle\phi\rangle, where ϕ\phi represents the flow variables. With the refinement of mesh, the HGKS results are in reasonable agreement with the reference date, which given by the spectral method with 256×193×192256\times 193\times 192 cells [4]. It should also be noted that spectral method is for the exact incompressible flow, which provides accurate incompressible solution. The HGKS is more general for both nearly incompressible flows and compressible flows. For the compressible flow, the time averaged normalized root-mean-square fluctuation velocity profiles Urms+U_{rms}^{+}, Vrms+V_{rms}^{+}, Wrms+W_{rms}^{+}, Reynolds stress ρUV/τw\langle-\rho U^{{}^{\prime}}V^{{}^{\prime}}\rangle/\langle\tau_{w}\rangle, the root-mean-square of Mach number Mrms+M_{rms}^{+} and turbulent Mach number MtM_{t} are presented in Fig.13 for the cases H1H_{1} and H2H_{2}. The turbulent Mach number is defined as Mt=q/cM_{t}=q/\left\langle c\right\rangle, where q2=(U)2+(V)2+(W)2q^{2}=\langle(U^{{}^{\prime}})^{2}+(V^{{}^{\prime}})^{2}+(W^{{}^{\prime}})^{2}\rangle and cc is the local sound speed. The results with H2H_{2} agree well with the refereed DNS solutions, confirming the high-accuracy of HGKS for DNS in compressible wall-bounded turbulent flows. The statistical fluctuation quantities profiles with FP32 precision and FP64 precision are also given in Fig.12 and Fig.13. Despite the deviation of Urms+U_{rms}^{+} for different precision, the qualitative differences in accuracy between FP32 and FP64 precision are acceptable for statistical turbulent quantities with such a long time average.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 14: Turbulent channel flow: the instantaneous profiles of U+U^{+}, V+V^{+}, W+W^{+} for G2G_{2} and H2H_{2} at t=20H/Uct=20H/U_{c} (top) and 200H/Uc200H/U_{c} (bottom) with FP32 and FP64 precision.

The instantaneous profiles of Urms+U^{+}_{rms}, Vrms+V^{+}_{rms}, Wrms+W^{+}_{rms} at t=20H/Uct=20H/U_{c} and 200H/Uc200H/U_{c} are shown in Fig.14 for the cases G2G_{2} and H2H_{2} with FP32 and FP64 precision, where cases restarted with the same initial flow field for each case. Compared with the time averaged profiles, the obvious deviations are observed for the instantaneous profiles. The deviations are caused by the round-off error of FP32 precision, which is approximately equals to or larger than the errors of numerical scheme. Since detailed effect of the round-off error is not easy to be analyzed, FP32 precision may not be safe for DNS in turbulence especially for time-evolutionary turbulent flows. In terms of the problems without very strict requirements in accuracy, such as large eddy simulation (LES) and Reynolds-averaged Navier–Stokes (RANS) simulation in turbulence, FP32 precision may be used due to its improvement of efficiency and reduction of memory. It also strongly suggests that the FP64 precision performance of GPU still requires to be improved to accommodate the increasing requirements of GPU-based HPC.

5 Conclusion

Based on the multi-scale physical transport and the coupled temporal-spatial gas evolution, the HGKS provides a workable tool for the numerical study of compressible turbulent flows. In this paper, to efficiently conduct large-scale DNS of turbulence, the HGKS code is developed with single GPU using CUDA architecture, and multiple GPUs using MPI and CUDA. The multiple GPUs are distributed across multiple CPUs at the cost of having to coordinate network communication via MPI. The Taylor-Green vortex problems and turbulent channel flows are presented to validate the performance of HGKS with multiple Nvidia TITAN RTX and Nvidia Tesla V100 GPUs. We mainly concentrate on the computational efficiency with single GPU and multiple GPUs, and the comparisons between FP32 precision and FP64 precision of GPU. For single-GPU computation, compared with the OpenMP CPU code using Intel Core i7-9700, 7x speedup is achieved by TITAN RTX and 16x speedup is achieved with Tesla V100. The computational time of single Tesla V100 GPU is comparable with the MPI code using 300300 supercomputer cores with Intel Xeon E5-2692. For multiple GPUs, the HGKS code scales properly with the number of GPU. It can be inferred that the efficiency of GPU code with 88 Tesla V100 GPUs approximately equals to that of MPI code with 30003000 CPU cores. Compared with FP64 precision simulation, the efficiency of HGKS can be improved and the memory is reduced with FP32 precision. However, with the long time computation in compressible turbulent channels flows, the differences in accuracy appear. Especially, the instantaneous statistical turbulent quantities is not acceptable using FP32 precision. The choice of precision should depend on the requirement of accuracy and the available computational resources. In the future, more challenging compressible flow problems, such as the supersonic turbulent boundary layer and the interaction of shock and turbulent boundary layer [41, 42], will be investigated with efficient multiple-GPU accelerated HGKS.

Ackonwledgement

This research is supported by National Natural Science Foundation of China (11701038), the Fundamental Research Funds for the Central Universities. We would thank Prof. Shucheng Pan of Northwestern Polytechnical University, Dr. Jun Peng and Dr. Shan Tang of Biren Technology for insightful discussions in GPU-based HPC.

References

  • [1] S.B. Pope. Turbulent flows, Cambridge, (2001).
  • [2] J. Kim, P. Moin, R. Moser, Turbulence statistics in fully developed channel flow at low Reynolds number, J. Fluid. Mech. 177 (1987) 133-166.
  • [3] S. Chen, G. Doolen, J.R. Herring, R.H. Kraichnan, , et al. Far-dissipation range of turbulence. Physical review letters, 70(20) (1993), 3051.
  • [4] R.D. Moser, J. Kim, N.N. Mansour, Direct numerical simulation of turbulent channel flow up to Reτ=590Re_{\tau}=590, Physics of Fluids 11 (1999) 943-945.
  • [5] M. Lee, R. Moser, Direct numerical simulation of turbulent channel flow up to Reτ5200Re_{\tau}\approx 5200, J. Fluid. Mech. 774 (2015) 395-415.
  • [6] J.C. Wang, L.P. Wang, Z.L. Xiao, Y. Shi, S.Y. Chen, A hybrid numerical simulation of isotropic compressible turbulence, J. Comput. Phys. 229 (2010) 5257-5279.
  • [7] G.Y. Cao, L. Pan, K. Xu, Three dimensional high-order gas-kinetic scheme for supersonic isotropic turbulence I: Criterion for direct numerical simulation, Computers and Fluids 192 (2019) 104273.
  • [8] G. Coleman, J. Kim, and R. Moser. A numerical study of turbulent supersonic isothermal-wall channel flow, J. Fluid. Mech. 305 (1995) 159-184.
  • [9] D.X. Fu, Y.W. Ma, X.L. Li, Q. Wang, Direct numerical simulation of compressible turbulence, Science Press (2010).
  • [10] Y. Ming, C.X. Xu, S. Pirozzoli, Genuine compressibility effects in wall-bounded turbulence. Physical Review Fluids 4.12 (2019) 123402.
  • [11] S. Pirozzoli, F Grasso, T.B. Gatski, Direct numerical simulation and analysis of a spatially evolving supersonic turbulent boundary layer at M=2.25M=2.25, Physics of fluids 16 (2004) 530-545.
  • [12] X. Liang, X.L. Li, D.X. Fu, Y.W. Ma, DNS and analysis of a spatially evolving hypersonic turbulent boundary layer over a flat plate at Mach 8, Sci. China Phys. Mech. Astron., 42 (2012) 282-293.
  • [13] P.L. Bhatnagar, E.P. Gross, M. Krook, A Model for Collision Processes in Gases I: Small Amplitude Processes in Charged and Neutral One-Component Systems, Phys. Rev. 94 (1954) 511-525.
  • [14] S. Chapman, T.G. Cowling, The Mathematical theory of Non-Uniform Gases, third edition, Cambridge University Press, (1990).
  • [15] K. Xu, A gas-kinetic BGK scheme for the Navier-Stokes equations and its connection with artificial dissipation and Godunov method, J. Comput. Phys. 171 (2001) 289-335.
  • [16] K. Xu, Direct modeling for computational fluid dynamics: construction and application of unified gas kinetic schemes, World Scientific (2015).
  • [17] J.Q. Li, Z.F. Du, A two-stage fourth order time-accurate discretization for Lax-Wendroff type flow solvers I. hyperbolic conservation laws, SIAM J. Sci. Computing, 38 (2016) 3046-3069.
  • [18] J.Q. Li, Two-stage fourth order: temporal-spatial coupling in computational fluid dynamics (CFD), Advances in Aerodynamics, (2019) 1:3.
  • [19] L. Pan, K. Xu, Q.B. Li, J.Q. Li, An efficient and accurate two-stage fourth-order gas-kinetic scheme for the Navier-Stokes equations, J. Comput. Phys. 326 (2016) 197-221.
  • [20] L. Pan, K. Xu, Two-stage fourth-order gas-kinetic scheme for three-dimensional Euler and Navier-Stokes solutions, Int. J. Comput. Fluid Dyn. 32 (2018) 395-411.
  • [21] Y.Q. Yang, L. Pan, K. Xu, High-order gas-kinetic scheme on three-dimensional unstructured meshes for compressible flows, Physics of Fluids 33 (2021) 096102.
  • [22] M.P.I. Forum, MPI: A Message-Passing Interface Standard, Version 2.2. High Performance Computing Center Stuttgart (2009).
  • [23] G.Y. Cao, L. Pan, K. Xu, High-order gas-kinetic scheme with parallel computation for direct numerical simulation of turbulent flows, J. Comput. Phys. 448 (2022) 110739.
  • [24] C.F. Xu, X.G. Deng, et el., Collaborating CPU and GPU for large-scale high-order CFD simulations with complex grids on the TianHe-1A supercomputer, J. Comput. Phys. 278 (2014) 275-297.
  • [25] J. O’Connora, J.M. Dominguez, B.D. Rogers, S.J. Lind, P.K. Stansby, Eulerian incompressible smoothed particle hydrodynamics on multiple GPUs, Comput. Phys. Commun. 273 (2022) 108263.
  • [26] Crialesi-Esposito, Marco, et al. FluTAS: A GPU-accelerated finite difference code for multiphase flows. arXiv preprint arXiv:2204.08834 (2022).
  • [27] P. K. Yeung, K. Ravikumar, Advancing understanding of turbulence through extreme-scale computation: Intermittency and simulations at large problem sizes, Physical Review Fluids 5 (2020) 110517.
  • [28] S. Pirozzoli, J. Romero, et el., One-point statistics for turbulent pipe flow up to Reτ6000Re_{\tau}\approx 6000, J. Fluid. Mech. 926 (2021).
  • [29] Y.H. Wang, L. Pan, Three-dimensional discontinuous Galerkin based high-order gas-kinetic scheme and GPU implementation, Computers &\& Fluids 55 (2022) 105510.
  • [30] M. Lehmann, M.J. Krause, et al., On the accuracy and performance of the lattice Boltzmann method with 64-bit, 32-bit and novel 16-bit number formats, arXiv preprint arXiv:2112.08926 (2021).
  • [31] A. Haidar, H. Bayraktar, et al., Mixed-precision iterative refinement using tensor cores on GPUs to accelerate solution of linear systems. Proceedings of the Royal Society A, 476(2243), 20200110.
  • [32] G.S. Jiang, C.W. Shu, Efficient implementation of Weighted ENO schemes, J. Comput. Phys. 126 (1996) 202-228.
  • [33] M. Castro, B. Costa, W. S. Don, High order weighted essentially non-oscillatory WENO-Z schemes for hyperbolic conservation laws, J. Comput. Phys. 230 (2011) 1766-1792.
  • [34] J. Kraus, An Introduction to CUDA-Aware MPI. https://developer.nvidia.com/blog/introduction-cuda-aware-mpi/ (2013).
  • [35] M.E. Brachet, D.I. Meiron, S.A. Orszag, B.G. Nickel, R.H. Morf, U. Frisch, Small-scale structure of the Taylor-Green vortex, J. Fluid. Mech. 130 (1983) 411-452.
  • [36] J. Debonis, Solutions of the Taylor-Green vortex problem using high-resolution explicit finite difference methods, AIAA 2013-0382.
  • [37] J. R. Bull, A. Jameson, Simulation of the compressible Taylor-Green vortex using high-order flux reconstruction schemes, AIAA 2014-3210.
  • [38] N. Peng, Y. Yang, Effects of the Mach Number on the Evolution of Vortex-Surface Fields in Compressible Taylor-Green Flows, Physical Review Fluids, 3 (2018) 013401.
  • [39] G.Y. Cao, L. Pan, K. Xu, S.Y. Chen, High-order gas-kinetic scheme in general curvilinear coordinate for iLES of compressible wall-bounded turbulent flows, arXiv.2107.08609.
  • [40] P.G. Huang, and N. C. Gary, Van Driest transformation and compressible wall-bounded flows. AIAA journal 32 (1994) 2110-2113.
  • [41] N. Adams, A. Direct simulation of the turbulent boundary layer along a compression ramp at M=3M=3 and Reθ= 1685, J. Fluid. Mech. 420 (2000): 47-83.
  • [42] M.W. Wu, M. Martin, Direct numerical simulation of supersonic turbulent boundary layer over a compression ramp, AIAA journal 45 (2007) 879-889.