Cuda fortran, graph

I am writing a CFD program using nvfortran. When I am using graph in cuda fortran, for the first a few steps, the calculation can be much faster. But it will slow down after about 150 iterations.

For the first 50 steps, cudaGraphLauch will cost about 40us/step. When it goes to about 2000 steps, cudaGraphLaunch will cost about 800-1000us/step. That means cudaGraphLaunch will cost more time when the iteration goes on.

istat = cudaStreamCreateWithFlags(stream, cudaStreamDefault)
print*, istat
istat = cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal)
print*, istat
call caldt<<<grid,tBlock,0,stream>>>(d_x, d_area, d_qp(:,:,:,nT), d_qc(:,:,:,nT), d_dt(:,:,nT), d_im,d_qc0(:,:,:,nT), d_qc(:,:,:,nT))
do mstep = 1,4
call inlet_out<<<1,256,0,stream>>>(d_qc(:,:,:,nT),d_p0,d_t0,d_pb,d_a,d_im)
call res_invs1<<<grid,tblock,0,stream>>>(d_im,d_ilete,d_dx1,d_dx2,d_qc(:,:,:,nT),d_qp(:,:,:,nT),d_flux1(:,:,:),d_flux2(:,:,:))
call update_invs_resj<<<grid,tblock,0,stream>>>(d_im,d_ilete,d_flux1(:,:,:),d_flux2(:,:,:),d_res(:,:,:,nT))
call res_invs3<<<3,256,0,stream>>>(d_im,d_ilete,d_dx2,d_qc(:,:,:,nT),d_qp(:,:,:,nT),d_flux3(:,:),d_res(:,:,:,nT))
call res_artis1<<<grid,tblock,0,stream>>>(d_im,d_ilete,d_dx1,d_dx2,d_area,d_dt(:,:,nT),d_qc(:,:,:,nT),d_qp(:,:,:,nT),d_flux4(:,:,:))
call res_artis2<<<grid,tblock,0,stream>>>(d_im,d_ilete,d_dx1,d_dx2,d_area,d_dt(:,:,nT),d_qc(:,:,:,nT),d_qp(:,:,:,nT),d_flux5(:,:,:))
call update_arti_res2<<<grid,tblock,0,stream>>>(d_im,d_ilete,d_flux4(:,:,:),d_flux5(:,:,:),d_res(:,:,:,nT))
call res_fss<<<3,256,0,stream>>>(d_im,d_ilete,d_dx2,d_qp(:,:,:,nT),d_res(:,:,:,nT))
call update_qc<<<grid,tblock,0,stream>>>(d_ilete,d_im,d_qc(:,:,:,nT),d_qc0(:,:,:,nT),d_qc1(:,:,:,nT),d_area,d_dt(:,:,nT),d_res(:,:,:,nT),d_cfl(1),d_tf(mstep))
call update_primitive<<<grid,tblock,44832,stream>>>(d_im,d_qc(:,:,:,nT),d_qp(:,:,:,nT))
enddo !mstep
istat = cudaStreamEndCapture(stream, graph)
print
, istat
buffer_len = 0
istat = cudaGraphInstantiate(graph_exec, graph, error_node, buffer, buffer_len)
print*, istat

call CPU_TIME(startt)

do iter = 1,200
istat = cudaGraphLaunch(graph_exec, stream)
call CPU_TIME(enddt)
write(,) (enddt-startt),iter!,err
startt=enddt
enddo!iter

This is my code, does anyone have some ideas? Thanks in advance!

Hi FRANKmartin,

I don’t use CUDA Graphs myself since my kernels tend to be longer running so the launch overhead rarely matters. Hence I asked some other folks but they’re not sure either.

Do you have a reproduceable example you can share so we can investigate?

Do the kernels simply run longer for the later iterations or should the execution time of the kernels be consistent?

-Mat

1 Like

Hello MatColgrove,

Thanks for your reply.
I am using another laptop, so I can only use this account to reply.

This is an example, when multi_time is set as 1000, you can see the problem.

module mytests
use cudafor
integer,parameter::xdim=1024*1024

contains
attributes(global) subroutine test1(a)
	implicit none
	integer :: idx,num,i
	integer :: a(xdim)
	idx = threadIdx%x
	num = blockDim%x
	do i=idx,xdim,num
		a(i) = i
	enddo
end subroutine test1

end module mytests

program main
use mytests
use cudafor
implicit none

integer,device,allocatable :: a_gpu(:)
type(dim3) :: tBlock
integer :: i, n, istat, multi_time=1000, t
integer(kind=cuda_stream_kind)::stream
type(cudaGraph) :: graph
type(cudaGraphExec) :: graph_exec
type(cudaGraphNode) :: error_node
character(len=1) :: buffer
integer:: buffer_len
integer::zero=0
integer::flag
real(kind=4)::s1,s2

allocate(a_gpu(xdim))

tblock=dim3(32,1,1)

istat = cudaStreamCreateWithFlags(stream, cudaStreamDefault)
print*, istat

istat = cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal)
print*, istat
do i=1,10
	call test1<<<1,tblock,0,stream>>>(a_gpu)
enddo
istat = cudaStreamEndCapture(stream, graph)
print*, istat
buffer_len = 0
istat = cudaGraphInstantiate(graph_exec, graph, error_node, buffer, buffer_len)
print*, istat




write(*,*) 'start with graph'
a_gpu=-1

call cpu_time(s1)
do t=1,multi_time
	
	istat = cudaGraphLaunch(graph_exec, stream)

	if(mod(t,100)==0)then
		call cpu_time(s2)
		write(*,*) t*10,(s2-s1)
		s1=s2
	endif

enddo


write(*,*) 'start with default call'

call cpu_time(s1)

do t=1,multi_time*10
	call test1<<<1,tblock,0,stream>>>(a_gpu)
	if(mod(t,1000)==0)then
		call cpu_time(s2)
		write(*,*) t,(s2-s1)
		s1=s2
	endif
enddo

end program

I am using nvidia hpc sdk 21.9, the GPU that i use are GTX2080 and A100.

Thanks for your reply, are there any ways to use Cuda graph on V100 or GTX2080 without slow downs? Or could you please set a larger multi_time in A100, something like 10000 or even larger, to see whether there is something wrong? I tried on A100 40G, there is the same problem. Thanks in advance!

When multi_time in the upper code is set as 1000, I will get the following results.

        0
        0
        0
        0

start with graph
1000 4.8494339E-04
2000 4.8089027E-04
3000 4.5800209E-04
4000 4.5108795E-04
5000 4.5394897E-04
6000 0.2611630
7000 0.2590711
8000 0.2590709
9000 0.2590702
10000 0.2590699
start with default call
1000 1.295333
2000 0.2825339
3000 0.2597539
4000 0.2597561
5000 0.2597580
6000 0.2597580
7000 0.2597563
8000 0.2597599
9000 0.2597561
10000 0.2597570

Are you seeing this on both the GTX2080 and the A100?

I tried on P100, V100 and A100, P100 and V100 (which uses Volta like your GTX2080), I’m able to reproduce the slow-downs. Though, no slow-down on the A100.

In case it’s a CUDA driver, I tried multiple V100s with various CUDA drivers, but still see the slow-down.

Hence my best guess is that it’s a hardware issue which has been addressed in the A100s.

Here’s my output from my A100 run:

% nvidia-smi
Wed Jul  6 11:18:38 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  On   | 00000000:4B:00.0 Off |                    0 |
| N/A   34C    P0    46W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  On   | 00000000:98:00.0 Off |                    0 |
| N/A   35C    P0    43W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
% nvfortran test.CUF -fast -V21.9 ; a.out
            0
            0
            0
            0
 start with graph
         1000   1.9502640E-04
         2000   1.9216537E-04
         3000   1.8000603E-04
         4000   1.8095970E-04
         5000   1.8095970E-04
         6000   1.8191338E-04
         7000   1.8000603E-04
         8000   1.8215179E-04
         9000   1.8000603E-04
        10000   1.8191338E-04
 start with default call
         1000    3.420586
         2000   0.4200239
         3000   0.3507664
         4000   0.3507595
         5000   0.3507738
         6000   0.3507733
         7000   0.3507771
         8000   0.3507690
         9000   0.3507795
        10000   0.3507700

-Mat

Your timing is incorrect.
You can’t use CPU time to report kernel time ( unless you put a cudaDeviceSynchronize after the kernel launch).

You could use nsys to generate a timeline and then look at the individual kernels.

I can only send three replies, thus i create a new account.

Thanks for your reply, are there any ways to use Cuda graph on V100 or GTX2080 without slow downs? Or could you please set a larger multi_time in A100, something like 10000 or even larger, to see whether there are still slow downs? I tried on A100 40G, there is the same problem. Thanks in advance!

Yes, you are right. Thanks for your idea. But I think it is not the problem. When I use nvprof to see the time cost of each subroutine. The time cost of cudaGraphLaunch is becoming larger with the increase of iterations.

If you use NSys/Nsight you will have a better picture.
Just run :
nsys profile ./a.out
and then import the report.nsys-rep in Nsight.
If you control-click on CUDA API and pick “Show in Event View” you can clearly see all the calls.

You will notice that cudaGraphLaunch is pretty constant at ~12 us for several hundred iterations ( ~500) and then jumps at 4ms ( probably some internal work queue is being filled).

For most CFD applications ( and I have worked on many), launch overhead is not an issue, I would not worry too much. There are usually biggest fish to fry
and the effect of the kernel launch will end up in the noise.

Cool! Thanks a lot!
As I am using a relatively small grid, which is about 0.2 million, thus the overhead is relatively large. Thanks for your information!