CUDA Fortran performance with explicit finite difference method

Hi everyone.

I’m experiencing a bad performance when using CUDA Fortran to accelerate a CFD code with an explicit finite difference method.

We don’t need to worry about verification and code correctness at this moment.

The problem is a 2D flow and the method uses time integration. One of the possibilities of accelerating this code is to use parallelism in the spatial derivative approximation. For example, a mesh with 20 points in each direction (400 in total) can be split into more than 400 simultaneous operations. There are some method details I don’t want to comment on to avoid a very long text.

This can be a problem because in this method three spatial derivative approximations are needed for each time step. For those interested in alternatives, the Parareal can be an option to work with time-dependent problems. Before diving into more code working, debugging, and parallelization I decided to check the performance and ask for some opinion.

So, let’s go! My mesh doesn’t have the same points in each direction and I decided to launch blocks with 1D threads (only in the x-direction). For small meshes, the thread quantity is lower than 1024. That is only one block for each direction of the problem. I’m launching one stream to each problem direction, so they can be executed in parallel if there are enough resources.

For bigger meshes, I’m dividing the amount of work into blocks and threads. The thread quantity is less or equal than 1024, and each block has almost the same thread quantity. I could also launch the first blocks with 1024 threads and the last one with the remaining threads (this could lead to a performance problem because a large number of threads could be just sitting and watching the others work). I don’t know if it’s possible to launch different thread quantities for different blocks in one kernel (my guess is it can’t be done).

Next, I present slices of the code.

Subroutine Time_integration

!Variables declaration

!Preparation to the time integration

Do while (t<tf)

  !Method specific operations
  !Time step computation

  !Boundary conditions application(on the host)
  !CUDA preparation and memory transfer Host to Device

  Call dev_Flux_x<<<Blocks_x,Threads_x,0,stream1>>>(variables)
  Call dev_Flux_y<<<Blocks_y,Threads_y,0,stream2>>>(variables)
  istat=cudaDeviceSynchronize()

  !Memory transfer Device to Host

  !Method specific operations
  !Update the variables

  !Boundary conditions application(on the host)
  !CUDA preparation and memory transfer Host to Device

  Call dev_Flux_x<<<Blocks_x,Threads_x,0,stream1>>>(variables)
  Call dev_Flux_y<<<Blocks_y,Threads_y,0,stream2>>>(variables)
  istat=cudaDeviceSynchronize()

  !Memory transfer Device to Host

  !Method specific operations
  !Update the variables

  !Boundary conditions application(on the host)
  !CUDA preparation and memory transfer Host to Device

  Call dev_Flux_x<<<Blocks_x,Threads_x,0,stream1>>>(variables)
  Call dev_Flux_y<<<Blocks_y,Threads_y,0,stream2>>>(variables)
  istat=cudaDeviceSynchronize()

  !Memory transfer Device to Host

  !Method specific operations
  !Update the variables

  !Gather some results and do intermediary analysis

  t=t+dt

End do

End subroutine Time_integration
Attributes(global) subroutine dev_Flux_x(variables)

!Variables declaration

b=BlockIdx%x
t=threadIdx%x

If (t<mt(b)) then

  !Memory allocation

  !Retrieve the i mesh position for thread t
  !Retrieve the j mesh position for thread t

  !Method specific operations

  !Dynamic parallelism
  Call dev_WENO<<<1,4>>>(variables)
  istat=cudaDeviceSynchronize()
  call syncthreads()

  !Method specific operations

End if

End subroutine dev_Flux_x

The dev_flux_y is similar and the dev_WENO don’t add any different idea. mt(b) is the max thread number for the block b, to avoid erroneous memory access and unnecessary computation. I did four test cases

  1. 660 points and 1 block with 350 threads in each direction
  2. 2644 points and 2 blocks with 681 threads in each direction
  3. 10588 points and 6 blocks with 896 threads in each direction
  4. 42352 points and 21 blocks with 1016 threads in each direction

For test case 1, the memory transfer Host to Device takes approximately 1 us and Device to Host approx. 3 us. dev_Flux_x and dev_Flux_y kernels duration are approx. 26 ms and dev_WENO approx. 10 us.

For test case 2, the memory transfer durations are practically the same. dev_Flux_x and dev_Flux_y kernels duration are approx. 70 ms and dev_WENO duration is practically the same.

For test case 3, the memory transfer Host to Device is practically the same as test case 1 and Device to Host increased to approx. 30 us. dev_Flux_x and dev_Flux_y kernels duration range from 200 ms to 400 ms and dev_WENO duration is practically the same as test case 1.

For test case 4, the memory transfer Host to Device is practically the same as test case 1 and Device to Host increased to approx. 120 us. dev_Flux_x and dev_Flux_y kernels duration range from 400 ms to 1.7 s and dev_WENO duration is practically the same as test case 1.

If we consider the worst scenario for memory transfer (120 us) and even 1 million time steps, the duration of memory transfers would be roughly 360 s, and this is such a small time to a big mesh simulation. I think we can discard the memory transfer overhead.

The dev_WENO kernel duration is practically the same for each case and it’s very short.

If we look to the dev_Flux_x (or dev_Flux_y) duration I think we can also discard the kernel overhead because the duration increase is big between meshes. To have a kernel overhead problem I would expect practically the same duration for all test cases.

For the test case 4, NVIDIA Nsight compute GPU speed of light report gives 15 % for SM (stream multiprocessor) and 27 % for memory.

The question is: what am I missing?

Obs. 1: the number of registers is limited to 64 for each thread.
Obs. 2: the cudaLimitMallocHeapSize was set to a bigger value.
Obs. 3: double precision computation.
Obs. 4: obs. 1 and 2 can save big time for those starting to work with CUDA.

Compiler: PGI Fortran.
SO: Debian 10.
GPU: GeForce GTX 1060 6G.

Best regards,
Nicholas.

I assume you are aware that consumer GPUs have poor throughput for double-precision operations? Around 120 DP GFLOPS for a Geforce 1060. How many DP GFLOPS does your code achieve?

Generally speaking, it is best to start with a block size of between 128 and 256 threads. This gives fine granularity of resource usage resulting in better overall utilization of GPU resources. Thread count per block should be a multiple of 32. For good performance you would want to run on the order of 10K threads total.

I am but I was expecting a better performance, even for a gaming GPU. How can I check the DP GFLOPS?

What do you mean? This maximum is for one kernel/stream?

Check the GFLOPS: Determine number of DP floating-point operations per kernel launch. Divide by kernel execution time. The CUDA profiler might have that metric readily available somewhere.

10K threads: Launch grid total. Not a maximum, but a recommended minimum. Modern GPUs typically have between 1000 and 6000 “cores”. Keeping that many cores gainfully occupied typically requires around 10K threads at minimum.

I didn’t find the gflops information. I read the documentation and executed the same commands but the profiler didn’t show the gflops. The DP floating-point operation can be retrieved through the following metric?

inst_fp_64: Number of double-precision floating-point instructions executed by non-predicated threads (arithmetic, compare, etc.)