Hi everyone.
I’m experiencing a bad performance when using CUDA Fortran to accelerate a CFD code with an explicit finite difference method.
We don’t need to worry about verification and code correctness at this moment.
The problem is a 2D flow and the method uses time integration. One of the possibilities of accelerating this code is to use parallelism in the spatial derivative approximation. For example, a mesh with 20 points in each direction (400 in total) can be split into more than 400 simultaneous operations. There are some method details I don’t want to comment on to avoid a very long text.
This can be a problem because in this method three spatial derivative approximations are needed for each time step. For those interested in alternatives, the Parareal can be an option to work with time-dependent problems. Before diving into more code working, debugging, and parallelization I decided to check the performance and ask for some opinion.
So, let’s go! My mesh doesn’t have the same points in each direction and I decided to launch blocks with 1D threads (only in the x-direction). For small meshes, the thread quantity is lower than 1024. That is only one block for each direction of the problem. I’m launching one stream to each problem direction, so they can be executed in parallel if there are enough resources.
For bigger meshes, I’m dividing the amount of work into blocks and threads. The thread quantity is less or equal than 1024, and each block has almost the same thread quantity. I could also launch the first blocks with 1024 threads and the last one with the remaining threads (this could lead to a performance problem because a large number of threads could be just sitting and watching the others work). I don’t know if it’s possible to launch different thread quantities for different blocks in one kernel (my guess is it can’t be done).
Next, I present slices of the code.
Subroutine Time_integration
!Variables declaration
!Preparation to the time integration
Do while (t<tf)
!Method specific operations
!Time step computation
!Boundary conditions application(on the host)
!CUDA preparation and memory transfer Host to Device
Call dev_Flux_x<<<Blocks_x,Threads_x,0,stream1>>>(variables)
Call dev_Flux_y<<<Blocks_y,Threads_y,0,stream2>>>(variables)
istat=cudaDeviceSynchronize()
!Memory transfer Device to Host
!Method specific operations
!Update the variables
!Boundary conditions application(on the host)
!CUDA preparation and memory transfer Host to Device
Call dev_Flux_x<<<Blocks_x,Threads_x,0,stream1>>>(variables)
Call dev_Flux_y<<<Blocks_y,Threads_y,0,stream2>>>(variables)
istat=cudaDeviceSynchronize()
!Memory transfer Device to Host
!Method specific operations
!Update the variables
!Boundary conditions application(on the host)
!CUDA preparation and memory transfer Host to Device
Call dev_Flux_x<<<Blocks_x,Threads_x,0,stream1>>>(variables)
Call dev_Flux_y<<<Blocks_y,Threads_y,0,stream2>>>(variables)
istat=cudaDeviceSynchronize()
!Memory transfer Device to Host
!Method specific operations
!Update the variables
!Gather some results and do intermediary analysis
t=t+dt
End do
End subroutine Time_integration
Attributes(global) subroutine dev_Flux_x(variables)
!Variables declaration
b=BlockIdx%x
t=threadIdx%x
If (t<mt(b)) then
!Memory allocation
!Retrieve the i mesh position for thread t
!Retrieve the j mesh position for thread t
!Method specific operations
!Dynamic parallelism
Call dev_WENO<<<1,4>>>(variables)
istat=cudaDeviceSynchronize()
call syncthreads()
!Method specific operations
End if
End subroutine dev_Flux_x
The dev_flux_y is similar and the dev_WENO don’t add any different idea. mt(b) is the max thread number for the block b, to avoid erroneous memory access and unnecessary computation. I did four test cases
- 660 points and 1 block with 350 threads in each direction
- 2644 points and 2 blocks with 681 threads in each direction
- 10588 points and 6 blocks with 896 threads in each direction
- 42352 points and 21 blocks with 1016 threads in each direction
For test case 1, the memory transfer Host to Device takes approximately 1 us and Device to Host approx. 3 us. dev_Flux_x and dev_Flux_y kernels duration are approx. 26 ms and dev_WENO approx. 10 us.
For test case 2, the memory transfer durations are practically the same. dev_Flux_x and dev_Flux_y kernels duration are approx. 70 ms and dev_WENO duration is practically the same.
For test case 3, the memory transfer Host to Device is practically the same as test case 1 and Device to Host increased to approx. 30 us. dev_Flux_x and dev_Flux_y kernels duration range from 200 ms to 400 ms and dev_WENO duration is practically the same as test case 1.
For test case 4, the memory transfer Host to Device is practically the same as test case 1 and Device to Host increased to approx. 120 us. dev_Flux_x and dev_Flux_y kernels duration range from 400 ms to 1.7 s and dev_WENO duration is practically the same as test case 1.
If we consider the worst scenario for memory transfer (120 us) and even 1 million time steps, the duration of memory transfers would be roughly 360 s, and this is such a small time to a big mesh simulation. I think we can discard the memory transfer overhead.
The dev_WENO kernel duration is practically the same for each case and it’s very short.
If we look to the dev_Flux_x (or dev_Flux_y) duration I think we can also discard the kernel overhead because the duration increase is big between meshes. To have a kernel overhead problem I would expect practically the same duration for all test cases.
For the test case 4, NVIDIA Nsight compute GPU speed of light report gives 15 % for SM (stream multiprocessor) and 27 % for memory.
The question is: what am I missing?
Obs. 1: the number of registers is limited to 64 for each thread.
Obs. 2: the cudaLimitMallocHeapSize was set to a bigger value.
Obs. 3: double precision computation.
Obs. 4: obs. 1 and 2 can save big time for those starting to work with CUDA.
Compiler: PGI Fortran.
SO: Debian 10.
GPU: GeForce GTX 1060 6G.
Best regards,
Nicholas.