kernel call overhead: timing results overhead is large for small # of calls

I’m trying to gauge the overhead associated with a kernel call. Unless i call the kernel a zillion of times in this loop, the launch overhead is actually really bad on my machine (unless i’m overlooking something). I’d like to know if anyone has tried a similar experiment and what its conclusions were.

I’m reporting results obtained on XP, with DevStudio2005, CUDA 1.1. I ran this on a DELL, Xeon, quad core, 2.8GHz, 3Gb RAM.

I used a grid of 1000 by 1000 blocks, each with 256 by 256 threads. In fact, the timing results seem to be insensitive to the execution configuration.

Here is what i got (i list the number of times i run the loop that invokes the empty kernel. I report three sets of results just to be able to get an idea about an average here…)

Loop w/ 1 kernel call

Processing time: 79.262657 (ms)

Processing time: 77.801155 (ms)

Processing time: 77.531998 (ms)

Loop 10 kernel calls

Processing time: 76.896141 (ms)

Processing time: 77.876907 (ms)

Processing time: 77.362549 (ms)

Loop 100 kernel calls

Processing time: 79.482323 (ms)

Processing time: 79.455841 (ms)

Processing time: 80.660873 (ms)

Loop 1000 kernel calls

Processing time: 86.425369 (ms)

Processing time: 85.208755 (ms)

Processing time: 84.457947 (ms)

Loop 10000 kernel calls

Processing time: 152.974518 (ms)

Processing time: 157.944000 (ms)

Processing time: 152.800629 (ms)

Loop 100000 kernel calls

Processing time: 836.776672 (ms)

Processing time: 837.642090 (ms)

Processing time: 827.006958 (ms)

Loop 1000000 kernel calls

Processing time: 7606.758789 (ms)

Processing time: 7628.625977 (ms)

Processing time: 7610.461426 (ms)

Loop 2000000 kernel calls

Processing time: 16004.808594 (ms)

Processing time: 15134.964844 (ms)

Processing time: 15147.358398 (ms)

I used the code below:

#include <stdio.h>

#include <stdlib.h>

#include <cuda_runtime.h>

#include <cutil.h>

__global__ void empty() {}

/************************************************************************/

/* Timing overhead of launching a kernel                                 */

/************************************************************************/

int main(int argc, char* argv[])

{

   int num_blocks  = atoi(argv[1]);

   int num_threads = atoi(argv[2]);

   int num_times   = atoi(argv[3]);

  // setup timer

   unsigned int timer = 0;

   CUT_SAFE_CALL( cutCreateTimer( &timer));

   CUT_SAFE_CALL( cutStartTimer( timer));

  // setup execution parameters

   dim3 gridStruct (num_blocks, num_blocks, 1);

   dim3 blockStruct(num_threads, num_threads, 1);

  for(int i = 0; i < num_times; i++)

   {

      empty<<<gridStruct, blockStruct>>>();

   }

  CUT_SAFE_CALL( cutStopTimer( timer));

   printf("Processing time: %f (ms)\n", cutGetTimerValue( timer));

   CUT_SAFE_CALL( cutDeleteTimer( timer));

  return 0;

}

btw, i used the following command line:

/OUT:"…\bin\win32\Release\kernelCallOverhead.exe" /INCREMENTAL:NO /NOLOGO /LIBPATH:“C:\CUDA\lib” /LIBPATH:“C:\Progra~1\NVIDIA~1\NVIDIA~1\common\lib” /MANIFEST /MANIFESTFILE:“Release\kernelCallOverhead.exe.intermediate.manifest” /DEBUG /PDB:“c:\Program Files\NVIDIA Corporation\NVIDIA CUDA SDK\bin\win32\Release\kernelCallOverhead.pdb” /SUBSYSTEM:CONSOLE /OPT:REF /OPT:ICF /LTCG /MACHINE:X86 /ERRORREPORT:PROMPT cudart.lib cutil32.lib kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib uuid.lib odbc32.lib odbccp32.lib

Maybe im missing something here but a 256x256 threads block wont even launch? Since the block size is 512 threads max.

Not to mention that if you dont copy the results back to host memory at some point, maybe the compiler will just optimize everything away. So, im thinking, maybe youre just seeing the time to actually go through a loop that does nothing.

If you really want to see kernel call overhead ONLY then cut and paste the empty<<<>>> line a whole bunch of time instead of using a loop, since a loop does take some time to execute.

agree with Ailleur. kernel most probably doesn’t launch. thing is…you won’t really notice it unless you use proper error handling code.

try to add the line

CUT_CHECK_ERROR("kernel failure!");

right after your kernel call, and run your program again in debug mode. I bet you’ll see an appropriate error message then…

  • michael

Ailleur,

it should be 16x16 (=256) threads and 100x100(=10000) blocks, my mistake…

as about the overhead of the for loop, i got rid of the for loop and only had a long sequence (copy and paste) of calls to “empty<<<>>>”

  • calling 100 times:

Processing time: 79.449738 (ms)

Processing time: 90.098373 (ms)

Processing time: 81.629860 (ms)

-calling 1000 times:

Processing time: 97.878975 (ms)

Processing time: 85.997063 (ms)

Processing time: 87.719788 (ms)

I tried to get the 10000 case to go, but the compiler took about 20 minutes and it didn’t finish…

At any rate, these are basically identical to the results i reported in my original post.

To conclude, i can’t quite sort things out here…

btw, i timed the effect of the “for loop” that launches the kernel multiple times. I did this in debug (in release the empty for-loop gets optimized out) and here is what i’ve got in terms of the “for-loop” overhead:

1 Iteration

Processing time: 0.001529 (ms)

Processing time: 0.001466 (ms)

Processing time: 0.001504 (ms)

10 Iterations

Processing time: 0.001549 (ms)

Processing time: 0.001536 (ms)

Processing time: 0.001544 (ms)

100 Iterations

Processing time: 0.001794 (ms)

Processing time: 0.001802 (ms)

Processing time: 0.001794 (ms)

1000 Iterations

Processing time: 0.004043 (ms)

Processing time: 0.004048 (ms)

Processing time: 0.004038 (ms)

10000 Iterations

Processing time: 0.026609 (ms)

Processing time: 0.026599 (ms)

Processing time: 0.026591 (ms)

100000 Iterations

Processing time: 0.252169 (ms)

Processing time: 0.252164 (ms)

Processing time: 0.253177 (ms)

1000000 Iterations

Processing time: 2.530625 (ms)

Processing time: 2.565116 (ms)

Processing time: 2.525184 (ms)

2000000 Iterations

Processing time: 5.069882 (ms)

Processing time: 5.074800 (ms)

Processing time: 5.040833 (ms)

This is in debug mode, so i imagine in release the overhead would be even smaller. Which goes to tell that what i’m seeing is very large overhead associated with the kernel launch.

Again, this seems to happen only when i have a small number of kernel calls. But this is the case in which i end up operating. I don’t have 1000000 successive kernel calls…

Any thoughts?

thank you.

Try adding one thing to your test.
Have a kernel call outside the loop and outside the timer, this should take care of all cuda initialisations and warming up and what not.

Ailleur - things fell into place, thanks for the suggestion.

I see pretty consistently an 8 us overhead associated with the kernel call provided I have a "warm-up call before i start doing the actual timing.

My next question would be this: if i call a kernel then do some work on the host and then after a while i call a kernel again, will i or will i not see the same large overhead (about 70 ms) i was noticing in timings without the “warm-up” kernel call. In other words, can the kernel launch mechanism get “cold” quickly or not? I’d like to know the answer here but i don’t think i’d know how to go about it off the top of my head.

At any rate here is the code that i’m using and getting meaningful and expected results for the kernel launch overhead:

#include <stdio.h>

#include <stdlib.h>

#include <cuda_runtime.h>

#include <cutil.h>

__global__ void empty() {}

/************************************************************************/

/* Timing overhead of launching a kernel                                 */

/************************************************************************/

int main(int argc, char* argv[])

{

   int num_blocks  = atoi(argv[1]);

   int num_threads = atoi(argv[2]);

   int num_times   = atoi(argv[3]);

  // setup execution parameters

   dim3 gridStruct (num_blocks, num_blocks, 1);

   dim3 blockStruct(num_threads, num_threads, 1);

  //warm-up call...

   empty<<<gridStruct, blockStruct>>>();

  // setup timer

   unsigned int timer = 0;

   CUT_SAFE_CALL( cutCreateTimer( &timer));

   CUT_SAFE_CALL( cutStartTimer( timer));

  for( int i=0; i<num_times; i++)

   {

      empty<<<gridStruct, blockStruct>>>();

      CUT_CHECK_ERROR("kernel failure!");

   }

   CUT_SAFE_CALL( cutStopTimer( timer));

   printf("Processing time: %f (ms)\n", cutGetTimerValue( timer));

   CUT_SAFE_CALL( cutDeleteTimer( timer));

  return 0;

}

Here are the results i’m seeing:

calls in the for-loop: 1

Processing time: 0.010739 (ms)

Processing time: 0.010180 (ms)

Processing time: 0.010667 (ms)

calls in the for-loop: 10

Processing time: 0.081580 (ms)

Processing time: 0.080056 (ms)

Processing time: 0.082665 (ms)

calls in the for-loop: 100

Processing time: 1.101638 (ms)

Processing time: 0.795309 (ms)

Processing time: 0.789608 (ms)

calls in the for-loop: 1000

Processing time: 7.933981 (ms)

Processing time: 7.659184 (ms)

Processing time: 7.798267 (ms)

calls in the for-loop: 10000

Processing time: 78.418259 (ms)

Processing time: 77.147629 (ms)

Processing time: 76.363297 (ms)

calls in the for-loop: 100000

Processing time: 777.456543 (ms)

Processing time: 798.018494 (ms)

Processing time: 799.255310 (ms)

calls in the for-loop: 1000000

Processing time: 7871.641602 (ms)

Processing time: 7867.771484 (ms)

Processing time: 7856.774414 (ms)

calls in the for-loop: 2000000

Processing time: 15986.023438 (ms)

Processing time: 15708.211914 (ms)

Processing time: 15796.656250 (ms)

I dont have the answer to that one and ill be interested if someone has it!

I would guess that the cuda context lives as long as the thread that created it, but thats just speculation.

just curious…you are saying your warm-up time is about 70-80 ms? how comes that I did not observe that time (my kernel steadily executes within 5 ms).

  • do u know When the GPU init starts? (is it at first kernel call, or call of CUT_DEVICE_INIT(), or whatever)?
  • is this warm-up time somehow constant (= no matter which kernel)? if so, that would mean my measurements are somehow messed up?

thx,
michael

If you dont do a device_init the init takes place on first cuda call.

i realise this is a very old topic but i did some experimentation on this using Fermi (GTX480) and CUDA 3.1 and decided to share my findings

the 1st kernel call overhead is 0.006848 ms
Every call after this takes 0.002496 ms
So the total for x calls=0.006848 +x*0.002496 milliseconds

hopefully someone will find this useful
cheers :)

i realise this is a very old topic but i did some experimentation on this using Fermi (GTX480) and CUDA 3.1 and decided to share my findings

the 1st kernel call overhead is 0.006848 ms
Every call after this takes 0.002496 ms
So the total for x calls=0.006848 +x*0.002496 milliseconds

hopefully someone will find this useful
cheers :)

What OS did you measure this on?

What OS did you measure this on?

32bit windows 7

32bit windows 7

I am running a small scale cuda kernel (matrix multiply of size 10x10 in Fortran OpenACC framework) just to test the kernel initialization/copy associated overheads. The very first time the kernel is called, it takes 31 milliseconds, the second call to the kernel is 0 milliseconds (must be some microseconds). I would like to know what this 31 milliseconds is associated to. Is this related to CUDA setup time? or is it related to kernel copy time? or?
if it is related to CUDA setup time, then any following kernel calls should not have this cost of 31 milliseconds associated with them. In my experiments of running a few kernels, any subsequent small sclae kernel call after the very first one, results in 0 milliseconds (in microseconds).
I Have also read in some of this forum’s links that the cost to CudaMalloc takes 30 milliseconds. Could the 30 millisecond I observe at kernel run time is associated with any sort of CudaMalloc?
Need to understand the overhead of calling an empty/small scale kernel.
Also if this 30 milliseconds is a one time overhead, or might it happen for some different kernels in consecutive runs?

Thanks in advance!