kernel launch overhead

Hi,

I am trying to better understand the time overhead required to launch kernels.
I have written the following test code:

program main
#ifdef _MCUDA_
USE cudafor
#endif
  implicit none
  integer*4 :: N,nc,i,k,itime,nt,nargs,ierr,iwait
  real*8, allocatable :: a(:), b(:)
  character*16 arg
  integer*4 :: dt1(8), dt2(8), it1, it2,istat, &
       idummy,icountrate,icountmax
  real*8 :: rt1,rt2

  N=1E4        !number of parallel threads
  nt=1000      !number of timing iterations
  nc=10        !number of compute iterations
  iwait=1      !wait for completion

  nargs = command_argument_count()
    if( nargs >= 1 ) then
       call getarg( 1, arg )
       read(arg,'(i)') N
    endif

    if( nargs >= 2 ) then
       call getarg( 2, arg )
       read(arg,'(i)') iwait
    endif

 allocate(a(N),b(N))

!--------------------------------------------
!test1: compute loop inside kernel
!$acc data region local(a,b)
!init
a=0.01
!$acc region
do i=1,N
b(i)=0.1
end do
!$acc end region

!$acc update device(a)
#ifdef _MCUDA_
 istat=cudaThreadSynchronize()
#endif
CALL SYSTEM_CLOCK(COUNT=it1,COUNT_RATE=icountrate,COUNT_MAX=icountmax)

!iteration loop to have some statistic
do itime=1,nt
   !initialization
#ifdef _MCUDA_
   IF (iwait==1) istat=cudaThreadSynchronize()
#endif
   !$acc region do kernel parallel, vector(256)
   do i=1,N
      do k=1,nc  !compute loop
         a(i)=a(i)*0.01+exp(b(i)*b(i))
      end do
   end do
   !$acc end region
end do


#ifdef _MCUDA_
 istat=cudaThreadSynchronize()
#endif
CALL SYSTEM_CLOCK(COUNT=it2)
!$acc update host(a)
!$acc end data region



 rt1 = ( REAL(it2) - REAL(it1) ) / REAL(icountrate)

write(*,"(A7,I,A7,I,A7,I)") ' N=', N, ' , nt=',nt,' , nc=',nc
print*, '1: sum(a)=',sum(a)
write(*,"(A,F10.2)")  ' 1: time per step (us) =', rt1/nt * 1E6

!--------------------------------------------
!test2: compute loop outside kernel
a=0.01

!$acc data region local(a,b)
!$acc update device(a)
#ifdef _MCUDA_
 istat=cudaThreadSynchronize()
#endif
CALL SYSTEM_CLOCK(COUNT=it1,COUNT_RATE=icountrate,COUNT_MAX=icountmax)
!iteration loop to have some statistic
do itime=1,nt
   do k=1,nc  !compute loop
#ifdef _MCUDA_
      IF (iwait==1) istat=cudaThreadSynchronize()
#endif
   !$acc region do kernel parallel, vector(256)
   do i=1,N
      a(i)=a(i)*0.01+exp(b(i)*b(i))
   end do
   !$acc end region
   end do
end do
#ifdef _MCUDA_
 istat=cudaThreadSynchronize()
#endif
CALL SYSTEM_CLOCK(COUNT=it2)
!$acc update host(a)
!$acc end data region

rt2 = ( REAL(it2) - REAL(it1) ) / REAL(icountrate)
!print time
print*, '2: sum(a)=',sum(a)
write(*,"(A,F10.2)")  ' 2: time per step (us) =', rt2/nt * 1E6
write(*,"(A,F10.2)")  ' Mean kernel overhead per launch (us)=', abs(rt1-rt2)/(real(nt*nc))*1E6

end program main

which I compile with the following command:
pgf90 -ta=nvidia -O3 -Minfo=accel -Mcuda -D_MCUDA_ -Mpreprocess -o kernel_overhead_timing kernel_overhead_timing.f90

If I now run it for 10000 parallel threads, I got overhead of about 30 us
./kernel_overhead_timing 10000 1
N= 10000 , nt= 1000 , nc= 10
1: sum(a)= 10202.52694098092
1: time per step (us) = 53.76
2: sum(a)= 10202.52694098092
2: time per step (us) = 334.37
Mean kernel overhead per launch (us)= 28.06

Note that for this first experiment, with the last option being set to 1, there is a call to cudaThreadSynchronize() before each kernel launch.

If I now run with option 0:
./kernel_overhead_timing 10000 0
N= 10000 , nt= 1000 , nc= 10
1: sum(a)= 10202.52694098092
1: time per step (us) = 26.44
2: sum(a)= 10202.52694098092
2: time per step (us) = 61.95
Mean kernel overhead per launch (us)= 3.55

I get 3.55 us. Which seems to indicate that the different kernel executions have been overlapping.

For a code using directives, would there be cases where an equivalent to cudaThreadSynchronize() is issued ? In this case one should count on a 30 us additional time when using multiple kernels.

Thanks,

Xavier

Hi Xavier,

Host code will block until the kernel is done executing so there is no need to add a call to “cudaThreadSynchronize”. I think the difference you’re seeing between the two runs is the overhead of calling this function.

I ran your program and set “CUDA_PROFILE” to 1 to get a device profile. Using a perl script I wrote to sum the values, I show very little difference in the GPU time between the two runs.

% pgf90 -ta=nvidia -O3  -Mcuda -D_MCUDA_ -Mpreprocess test.f90 -V11.10
% a.out 10000 1
     N=       10000  , nt=        1000  , nc=          10
 1: sum(a)=    10202.52694098092     
 1: time per step (us) =     45.08
 2: sum(a)=    10202.52694098092     
 2: time per step (us) =    201.89
 Mean kernel overhead per launch (us)=     15.68
% perl totalProf.pl cuda_profile_0.log
                             main_37_gpu   0.000007   0.000020
                              memcpyDtoH   0.000029   0.000083
                              memcpyHtoD   0.000031   0.000048
                             main_55_gpu   0.026302   0.004748
                             main_96_gpu   0.060909   0.037238
                                 Totals:   0.087277   0.042137

% a.out 10000 0
     N=       10000  , nt=        1000  , nc=          10
 1: sum(a)=    10202.52694098092     
 1: time per step (us) =     28.91
 2: sum(a)=    10202.52694098092     
 2: time per step (us) =     81.23
 Mean kernel overhead per launch (us)=      5.23
% perl totalProf.pl cuda_profile_0.log
                             main_37_gpu   0.000007   0.000021
                              memcpyDtoH   0.000029   0.000086
                              memcpyHtoD   0.000031   0.000046
                             main_55_gpu   0.026188   0.004331
                             main_96_gpu   0.059881   0.034520
                                 Totals:   0.086135   0.039004

The main overhead in launching a kernel has to do with copying the kernel code itself to the GPU as well as copying the arguments. Hence, the overhead can vary quite a bit. One thing you can look at is the profile output from the “-ta=nvidia,time” flag. It will give you initialization times.

% pgf90 -ta=nvidia,time -O3 -Mcuda -D_MCUDA_ -Mpreprocess test.f90 -V11.10
% a.out 10000 0
     N=       10000  , nt=        1000  , nc=          10
 1: sum(a)=    10202.52694098092     
 1: time per step (us) =     48.39
 2: sum(a)=    10202.52694098092     
 2: time per step (us) =    276.12
 Mean kernel overhead per launch (us)=     22.77

Accelerator Kernel Timing data
/home/colgrove/support/xlapillonne/12_07_11/test.f90
  main
    106: region entered 1 time
        time(us): init=0
                  data=55
/home/colgrove/support/xlapillonne/12_07_11/test.f90
  main
    95: region entered 10000 times
        time(us): total=274319 init=817 region=273502
                  kernels=145376 data=0
        w/o init: total=273502 max=61 min=26 avg=27
        96: kernel launched 10000 times
            grid: [40]  block: [256]
            time(us): total=145376 max=48 min=13 avg=14
/home/colgrove/support/xlapillonne/12_07_11/test.f90
  main
    84: region entered 1 time
        time(us): init=0
                  data=48
/home/colgrove/support/xlapillonne/12_07_11/test.f90
  main
    83: region entered 1 time
        time(us): total=276442 init=1 region=276441
        w/o init: total=276441 max=276441 min=276441 avg=276441
/home/colgrove/support/xlapillonne/12_07_11/test.f90
  main
    68: region entered 1 time
        time(us): init=0
                  data=52
/home/colgrove/support/xlapillonne/12_07_11/test.f90
  main
    54: region entered 1000 times
        time(us): total=48202 init=76 region=48126
                  kernels=34788 data=0
        w/o init: total=48126 max=63 min=46 avg=48
        55: kernel launched 1000 times
            grid: [40]  block: [256]
            time(us): total=34788 max=45 min=34 avg=34
/home/colgrove/support/xlapillonne/12_07_11/test.f90
  main
    42: region entered 1 time
        time(us): init=0
                  data=57
/home/colgrove/support/xlapillonne/12_07_11/test.f90
  main
    36: region entered 1 time
        time(us): total=80 init=1 region=79
                  kernels=38 data=0
        w/o init: total=79 max=79 min=79 avg=79
        37: kernel launched 1 times
            grid: [40]  block: [256]
            time(us): total=38 max=38 min=38 avg=38
/home/colgrove/support/xlapillonne/12_07_11/test.f90
  main
    33: region entered 1 time
        time(us): total=170986 init=121122 region=49864
        w/o init: total=49864 max=49864 min=49864 avg=49864

Hope this helps,
Mat

Hi Mat,

Thanks for clarifying things.

In fact I am trying to understand why is it faster in this case to call one larger kernel as compare to many smaller kernels. So looking only at the case

% a.out 10000 1
...
1: time per step (us) =     45.0
...
2: time per step (us) =    201.89

I though at first that the difference between the two timings was coming from the initialisation time, but the ta=nvidia,time info seems to indicate that it is in fact very small.

    95: region entered 10000 times 
        time(us): total=274319 init=817 region=273502 
                  kernels=145376 data=0

which would mean 817/10000 per launch. This is way under the 3.5 us I was looking at.

How should one interpret the difference between region=273502 and kernels=145376, what is the code doing during this time ?

Xavier[/code]

How should one interpret the difference between region=273502 and kernels=145376, what is the code doing during this time ?

It’s due to the fact that the profile information is being gathered from the host side. Hence the region timing can include CPU execution time, and much of this time can be attributed to setting up the call to the device. For this code, calling the kernel few times does seem to be better.

In fact I am trying to understand why is it faster in this case to call one larger kernel as compare to many smaller kernels.

In this case, the kernel is the same size (which is actually very small). You’re just giving it a different amount of work.

Typically, the large versus small kernel challenge has to do with the amount of resources, registers, constant memory, and shared memory, a kernel consumes. Since there are finite amount of resources, the more a single kernel uses, the few number of kernels that can be run at the same time. This is called “Occupancy”. (NVIDIA has a nice Execl spreadsheet which helps determine occupany http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls) Generally, a program with low occupancy will have lower performance. But a high occupancy kernel may or may not have higher performance.

  • Mat

Occupany information can be seen in the compiler feedback messages:

% pgf90 -ta=nvidia -O3 -Mpreprocess test.f90 -V11.10 -Minfo=accel
main:
     33, Generating local(b(:))
         Generating local(a(:))
     36, Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     37, Loop is parallelizable
         Accelerator kernel generated
         37, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
             CC 1.3 : 6 registers; 44 shared, 4 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 8 registers; 4 shared, 56 constant, 0 local memory bytes; 100% occupancy
     42, Generating !$acc update device(a(:))
     54, Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     55, Loop is parallelizable
         Accelerator kernel generated
         55, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
             CC 1.3 : 14 registers; 60 shared, 28 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 18 registers; 4 shared, 84 constant, 0 local memory bytes; 100% occupancy
     56, Complex loop carried dependence of 'a' prevents parallelization
         Loop carried dependence of 'a' prevents parallelization
         Loop carried backward dependence of 'a' prevents vectorization
     68, Generating !$acc update host(a(:))
     83, Generating local(b(:))
         Generating local(a(:))
     84, Generating !$acc update device(a(:))
     95, Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     96, Loop is parallelizable
         Accelerator kernel generated
         96, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
             CC 1.3 : 10 registers; 60 shared, 28 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 13 registers; 4 shared, 84 constant, 0 local memory bytes; 100% occupancy
    106, Generating !$acc update host(a(:))

Hi Mat,
In PVF 2014 project setting , where I can activate the flag of “-ta=nvidia,time”?

Actually what I’m interested in is in PVF how can I get the overhead time of kernel launch? When I turn on “accelerator profiling” in PVF, I only get something like below, without telling me the kernel launch overhead.

275: cmompute region reached 1347 times
276: kernel launched 1347 times
grid: [1] block: [128]
device time(us): total=2,698 max=4 min=2 avg=2
elapsed time(us): total=219,000 max=16,000 min=0 avg=162


Thanks,
JMa

Hi JMa,

You’d need to set the flag under command tab since it doesn’t have explicit settings in the property options. Though, the preferred method is to set the environment variable PGI_ACC_TIME=1 since this allow you to enable/disable profiling without having to relink.

The difference between the elapsed time and the device time is essentially the kernel overhead. Elapsed is measured from the host so includes copying arguments, setting up the launch, launching the kernel, etc. Device time is measured via a cuda event so just measures kernel execution time.

Hope this helps,
Mat

Hi Mat,
Many thanks for the help.
I tried to put this flag at: Project Properties–>Linker–>Command Line.
However, I still don’t see the info. about launch overhead:

device time(us): total=2 max=2 min=2 avg=2
82: compute region reached 1 time
84: kernel launched 1 time
grid: [2500] block: [128]
device time(us): total=220 max=220 min=220 avg=220

By the way, I also noticed the clock time (measured between before and after the !$acc kernels loop) is much more than even elapsed time. Do you know what might the time cost for?

At last, do know is there a way to reduce the overhead of kernel launch?

Thanks,
JMA

Hi JMa,

However, I still don’t see the info. about launch overhead:

Hmm. The host timer measures from the time for the entire kernel region so if you have multiple kernel launches, you’ll have several “device” times and one elapsed. It’s possible that’s what’s happening here.

By the way, I also noticed the clock time (measured between before and after the !$acc kernels loop) is much more than even elapsed time. Do you know what might the time cost for?

By clock time, you mean you’ve added you’re own timer? Is this the first kernel launch? There’s a start-up cost with the GPU which is roughly 1-2 seconds per attached device. You may be getting this start-up cost in your clock timer. Try calling “acc_init” before your timer to see if it goes away.

At last, do know is there a way to reduce the overhead of kernel launch?

You can eliminate the start-up cost by running the “pgcudainit” utility as a background process. This will hold open the devices instead of allowing them to power down. Though, this is only worthwhile for small runs given it’s amortized for larger runs.

  • Mat

Hi Mat,
Many thanks.
1, I re-tried in a small sample (pasted below) which only has one loop. I still don’t see the elapsed time info.

call acc_init

print *
print *, 'GPU Comp. Start @ hh/mm/ss: ', time
print *

start = omp_get_wtime()


!$acc kernels loop
do i = 1,n
do j =1, m
if (j.eq.1) r(i) = a(i) * 2.0
r(i) = a(i) * 2.0
enddo
enddo

call date_and_time(TIME=time)

print *
print *, 'GPU Comp. end @ hh/mm/ss : ', time
print *

finish = omp_get_wtime()
time_gpu = finish-start

print *
print *,'Wall Time = ', time_gpu

2, Thank you, yes, acc_init works to exclude the start time.
In PVF, how can I start “pgcudainit” utility? Also, will it work actually the same as “call acc_init”?

3, When I ran the above small sample. I got below. As seen, the device time is only about 290us, while the wall time excluding start-up cost is much larger: Wall Time (s) = 1.7535036E-02.
So all this difference is due to kernel launch overhead? How can I have PVF to directly output this part of time. And, more importantly, how should I minimize this overhead?

Thanks,
JMa

Outer Loop Num.: 1000000
Inner Loop Num.: 1000000000

This is to Compare GPU and CPU computation time

GPU Init. Start @ hh/mm/ss: 154405.238


GPU Comp. Start @ hh/mm/ss: 154405.238


GPU Comp. end @ hh/mm/ss : 154405.487

Wall Time = 1.7535036E-02

Accelerator Kernel Timing data
B:\Projects\DF-GPU\Test_PGI\1stOpenACCProgram.cuf
main NVIDIA devicenum=0
time(us): 294
59: data region reached 1 time
59: data copyin transfers: 1
device time(us): total=2 max=2 min=2 avg=2
67: data copyout transfers: 1
device time(us): total=2 max=2 min=2 avg=2
59: compute region reached 1 time
61: kernel launched 1 time
grid: [7813] block: [128]
device time(us): total=290 max=290 min=290 avg=290