Hi,
I am trying to better understand the time overhead required to launch kernels.
I have written the following test code:
program main
#ifdef _MCUDA_
USE cudafor
#endif
implicit none
integer*4 :: N,nc,i,k,itime,nt,nargs,ierr,iwait
real*8, allocatable :: a(:), b(:)
character*16 arg
integer*4 :: dt1(8), dt2(8), it1, it2,istat, &
idummy,icountrate,icountmax
real*8 :: rt1,rt2
N=1E4 !number of parallel threads
nt=1000 !number of timing iterations
nc=10 !number of compute iterations
iwait=1 !wait for completion
nargs = command_argument_count()
if( nargs >= 1 ) then
call getarg( 1, arg )
read(arg,'(i)') N
endif
if( nargs >= 2 ) then
call getarg( 2, arg )
read(arg,'(i)') iwait
endif
allocate(a(N),b(N))
!--------------------------------------------
!test1: compute loop inside kernel
!$acc data region local(a,b)
!init
a=0.01
!$acc region
do i=1,N
b(i)=0.1
end do
!$acc end region
!$acc update device(a)
#ifdef _MCUDA_
istat=cudaThreadSynchronize()
#endif
CALL SYSTEM_CLOCK(COUNT=it1,COUNT_RATE=icountrate,COUNT_MAX=icountmax)
!iteration loop to have some statistic
do itime=1,nt
!initialization
#ifdef _MCUDA_
IF (iwait==1) istat=cudaThreadSynchronize()
#endif
!$acc region do kernel parallel, vector(256)
do i=1,N
do k=1,nc !compute loop
a(i)=a(i)*0.01+exp(b(i)*b(i))
end do
end do
!$acc end region
end do
#ifdef _MCUDA_
istat=cudaThreadSynchronize()
#endif
CALL SYSTEM_CLOCK(COUNT=it2)
!$acc update host(a)
!$acc end data region
rt1 = ( REAL(it2) - REAL(it1) ) / REAL(icountrate)
write(*,"(A7,I,A7,I,A7,I)") ' N=', N, ' , nt=',nt,' , nc=',nc
print*, '1: sum(a)=',sum(a)
write(*,"(A,F10.2)") ' 1: time per step (us) =', rt1/nt * 1E6
!--------------------------------------------
!test2: compute loop outside kernel
a=0.01
!$acc data region local(a,b)
!$acc update device(a)
#ifdef _MCUDA_
istat=cudaThreadSynchronize()
#endif
CALL SYSTEM_CLOCK(COUNT=it1,COUNT_RATE=icountrate,COUNT_MAX=icountmax)
!iteration loop to have some statistic
do itime=1,nt
do k=1,nc !compute loop
#ifdef _MCUDA_
IF (iwait==1) istat=cudaThreadSynchronize()
#endif
!$acc region do kernel parallel, vector(256)
do i=1,N
a(i)=a(i)*0.01+exp(b(i)*b(i))
end do
!$acc end region
end do
end do
#ifdef _MCUDA_
istat=cudaThreadSynchronize()
#endif
CALL SYSTEM_CLOCK(COUNT=it2)
!$acc update host(a)
!$acc end data region
rt2 = ( REAL(it2) - REAL(it1) ) / REAL(icountrate)
!print time
print*, '2: sum(a)=',sum(a)
write(*,"(A,F10.2)") ' 2: time per step (us) =', rt2/nt * 1E6
write(*,"(A,F10.2)") ' Mean kernel overhead per launch (us)=', abs(rt1-rt2)/(real(nt*nc))*1E6
end program main
which I compile with the following command:
pgf90 -ta=nvidia -O3 -Minfo=accel -Mcuda -D_MCUDA_ -Mpreprocess -o kernel_overhead_timing kernel_overhead_timing.f90
If I now run it for 10000 parallel threads, I got overhead of about 30 us
./kernel_overhead_timing 10000 1
N= 10000 , nt= 1000 , nc= 10
1: sum(a)= 10202.52694098092
1: time per step (us) = 53.76
2: sum(a)= 10202.52694098092
2: time per step (us) = 334.37
Mean kernel overhead per launch (us)= 28.06
Note that for this first experiment, with the last option being set to 1, there is a call to cudaThreadSynchronize() before each kernel launch.
If I now run with option 0:
./kernel_overhead_timing 10000 0
N= 10000 , nt= 1000 , nc= 10
1: sum(a)= 10202.52694098092
1: time per step (us) = 26.44
2: sum(a)= 10202.52694098092
2: time per step (us) = 61.95
Mean kernel overhead per launch (us)= 3.55
I get 3.55 us. Which seems to indicate that the different kernel executions have been overlapping.
For a code using directives, would there be cases where an equivalent to cudaThreadSynchronize() is issued ? In this case one should count on a 30 us additional time when using multiple kernels.
Thanks,
Xavier