I am writing a CFD program using nvfortran. When I am using graph in cuda fortran, for the first a few steps, the calculation can be much faster. But it will slow down after about 150 iterations.
For the first 50 steps, cudaGraphLauch will cost about 40us/step. When it goes to about 2000 steps, cudaGraphLaunch will cost about 800-1000us/step. That means cudaGraphLaunch will cost more time when the iteration goes on.
istat = cudaStreamCreateWithFlags(stream, cudaStreamDefault)
print*, istat
istat = cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal)
print*, istat
call caldt<<<grid,tBlock,0,stream>>>(d_x, d_area, d_qp(:,:,:,nT), d_qc(:,:,:,nT), d_dt(:,:,nT), d_im,d_qc0(:,:,:,nT), d_qc(:,:,:,nT))
do mstep = 1,4
call inlet_out<<<1,256,0,stream>>>(d_qc(:,:,:,nT),d_p0,d_t0,d_pb,d_a,d_im)
call res_invs1<<<grid,tblock,0,stream>>>(d_im,d_ilete,d_dx1,d_dx2,d_qc(:,:,:,nT),d_qp(:,:,:,nT),d_flux1(:,:,:),d_flux2(:,:,:))
call update_invs_resj<<<grid,tblock,0,stream>>>(d_im,d_ilete,d_flux1(:,:,:),d_flux2(:,:,:),d_res(:,:,:,nT))
call res_invs3<<<3,256,0,stream>>>(d_im,d_ilete,d_dx2,d_qc(:,:,:,nT),d_qp(:,:,:,nT),d_flux3(:,:),d_res(:,:,:,nT))
call res_artis1<<<grid,tblock,0,stream>>>(d_im,d_ilete,d_dx1,d_dx2,d_area,d_dt(:,:,nT),d_qc(:,:,:,nT),d_qp(:,:,:,nT),d_flux4(:,:,:))
call res_artis2<<<grid,tblock,0,stream>>>(d_im,d_ilete,d_dx1,d_dx2,d_area,d_dt(:,:,nT),d_qc(:,:,:,nT),d_qp(:,:,:,nT),d_flux5(:,:,:))
call update_arti_res2<<<grid,tblock,0,stream>>>(d_im,d_ilete,d_flux4(:,:,:),d_flux5(:,:,:),d_res(:,:,:,nT))
call res_fss<<<3,256,0,stream>>>(d_im,d_ilete,d_dx2,d_qp(:,:,:,nT),d_res(:,:,:,nT))
call update_qc<<<grid,tblock,0,stream>>>(d_ilete,d_im,d_qc(:,:,:,nT),d_qc0(:,:,:,nT),d_qc1(:,:,:,nT),d_area,d_dt(:,:,nT),d_res(:,:,:,nT),d_cfl(1),d_tf(mstep))
call update_primitive<<<grid,tblock,44832,stream>>>(d_im,d_qc(:,:,:,nT),d_qp(:,:,:,nT))
enddo !mstep
istat = cudaStreamEndCapture(stream, graph)
print, istat
buffer_len = 0
istat = cudaGraphInstantiate(graph_exec, graph, error_node, buffer, buffer_len)
print*, istat
call CPU_TIME(startt)
do iter = 1,200
istat = cudaGraphLaunch(graph_exec, stream)
call CPU_TIME(enddt)
write(,) (enddt-startt),iter!,err
startt=enddt
enddo!iter
This is my code, does anyone have some ideas? Thanks in advance!
Hi FRANKmartin,
I don’t use CUDA Graphs myself since my kernels tend to be longer running so the launch overhead rarely matters. Hence I asked some other folks but they’re not sure either.
Do you have a reproduceable example you can share so we can investigate?
Do the kernels simply run longer for the later iterations or should the execution time of the kernels be consistent?
-Mat
Hello MatColgrove,
Thanks for your reply.
I am using another laptop, so I can only use this account to reply.
This is an example, when multi_time is set as 1000, you can see the problem.
module mytests
use cudafor
integer,parameter::xdim=1024*1024
contains
attributes(global) subroutine test1(a)
implicit none
integer :: idx,num,i
integer :: a(xdim)
idx = threadIdx%x
num = blockDim%x
do i=idx,xdim,num
a(i) = i
enddo
end subroutine test1
end module mytests
program main
use mytests
use cudafor
implicit none
integer,device,allocatable :: a_gpu(:)
type(dim3) :: tBlock
integer :: i, n, istat, multi_time=1000, t
integer(kind=cuda_stream_kind)::stream
type(cudaGraph) :: graph
type(cudaGraphExec) :: graph_exec
type(cudaGraphNode) :: error_node
character(len=1) :: buffer
integer:: buffer_len
integer::zero=0
integer::flag
real(kind=4)::s1,s2
allocate(a_gpu(xdim))
tblock=dim3(32,1,1)
istat = cudaStreamCreateWithFlags(stream, cudaStreamDefault)
print*, istat
istat = cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal)
print*, istat
do i=1,10
call test1<<<1,tblock,0,stream>>>(a_gpu)
enddo
istat = cudaStreamEndCapture(stream, graph)
print*, istat
buffer_len = 0
istat = cudaGraphInstantiate(graph_exec, graph, error_node, buffer, buffer_len)
print*, istat
write(*,*) 'start with graph'
a_gpu=-1
call cpu_time(s1)
do t=1,multi_time
istat = cudaGraphLaunch(graph_exec, stream)
if(mod(t,100)==0)then
call cpu_time(s2)
write(*,*) t*10,(s2-s1)
s1=s2
endif
enddo
write(*,*) 'start with default call'
call cpu_time(s1)
do t=1,multi_time*10
call test1<<<1,tblock,0,stream>>>(a_gpu)
if(mod(t,1000)==0)then
call cpu_time(s2)
write(*,*) t,(s2-s1)
s1=s2
endif
enddo
end program
I am using nvidia hpc sdk 21.9, the GPU that i use are GTX2080 and A100.
Thanks for your reply, are there any ways to use Cuda graph on V100 or GTX2080 without slow downs? Or could you please set a larger multi_time in A100, something like 10000 or even larger, to see whether there is something wrong? I tried on A100 40G, there is the same problem. Thanks in advance!
When multi_time in the upper code is set as 1000, I will get the following results.
0
0
0
0
start with graph
1000 4.8494339E-04
2000 4.8089027E-04
3000 4.5800209E-04
4000 4.5108795E-04
5000 4.5394897E-04
6000 0.2611630
7000 0.2590711
8000 0.2590709
9000 0.2590702
10000 0.2590699
start with default call
1000 1.295333
2000 0.2825339
3000 0.2597539
4000 0.2597561
5000 0.2597580
6000 0.2597580
7000 0.2597563
8000 0.2597599
9000 0.2597561
10000 0.2597570
Are you seeing this on both the GTX2080 and the A100?
I tried on P100, V100 and A100, P100 and V100 (which uses Volta like your GTX2080), I’m able to reproduce the slow-downs. Though, no slow-down on the A100.
In case it’s a CUDA driver, I tried multiple V100s with various CUDA drivers, but still see the slow-down.
Hence my best guess is that it’s a hardware issue which has been addressed in the A100s.
Here’s my output from my A100 run:
% nvidia-smi
Wed Jul 6 11:18:38 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100 80G... On | 00000000:4B:00.0 Off | 0 |
| N/A 34C P0 46W / 300W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80G... On | 00000000:98:00.0 Off | 0 |
| N/A 35C P0 43W / 300W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
% nvfortran test.CUF -fast -V21.9 ; a.out
0
0
0
0
start with graph
1000 1.9502640E-04
2000 1.9216537E-04
3000 1.8000603E-04
4000 1.8095970E-04
5000 1.8095970E-04
6000 1.8191338E-04
7000 1.8000603E-04
8000 1.8215179E-04
9000 1.8000603E-04
10000 1.8191338E-04
start with default call
1000 3.420586
2000 0.4200239
3000 0.3507664
4000 0.3507595
5000 0.3507738
6000 0.3507733
7000 0.3507771
8000 0.3507690
9000 0.3507795
10000 0.3507700
-Mat
Your timing is incorrect.
You can’t use CPU time to report kernel time ( unless you put a cudaDeviceSynchronize after the kernel launch).
You could use nsys to generate a timeline and then look at the individual kernels.
I can only send three replies, thus i create a new account.
Thanks for your reply, are there any ways to use Cuda graph on V100 or GTX2080 without slow downs? Or could you please set a larger multi_time in A100, something like 10000 or even larger, to see whether there are still slow downs? I tried on A100 40G, there is the same problem. Thanks in advance!
Yes, you are right. Thanks for your idea. But I think it is not the problem. When I use nvprof to see the time cost of each subroutine. The time cost of cudaGraphLaunch is becoming larger with the increase of iterations.
If you use NSys/Nsight you will have a better picture.
Just run :
nsys profile ./a.out
and then import the report.nsys-rep in Nsight.
If you control-click on CUDA API and pick “Show in Event View” you can clearly see all the calls.
You will notice that cudaGraphLaunch is pretty constant at ~12 us for several hundred iterations ( ~500) and then jumps at 4ms ( probably some internal work queue is being filled).
For most CFD applications ( and I have worked on many), launch overhead is not an issue, I would not worry too much. There are usually biggest fish to fry
and the effect of the kernel launch will end up in the noise.
Cool! Thanks a lot!
As I am using a relatively small grid, which is about 0.2 million, thus the overhead is relatively large. Thanks for your information!