Overhead of launching a new thread block

As we know, when a thread block terminates, global block scheduler picks and schedules a new thread block to an SM.

My question is: how to measure the time of launching a new thread block.

My initial idea is timing execution of an empty kernel global void empty{ }, which can measure the overhead of launching a kernel using only one thread. As for overhead of launching a thread block, I am not sure how to measure it.

Any pointer is welcome. Thank you.

Hello community,

I guess my previous description is not clear enough, so I edit my question to make it more clear.

I can imagine a method to estimate (upper bound) the minimum latency for threadblock scheduling. I’m not sure if that is what you want.

In a nutshell, launch a kernel with lots of threadblocks.

Each threadblock can consist of a single thread.

Each thread should:

  1. read the per SM clock64 counter
  2. store that in an array (entry time-stamp) associated with that SM. google get_smid to find out how to determine which SM you are on
  3. read the per SM clock64 counter again
  4. store that in another array (exit time-stamp) associated with that SM.
  5. exit

You may want to use atomics or some other method to get the “next entry” pointer for storage into the array.

Then for each SM post-process the arrays, looking for the minimum positive timestamp differece between any timestamp in the entry array and any timestamp in the exit array.

For that SM, this minimum timestamp difference becomes your upper-bound estimate of the minimum scheduling time, as measured in SM clocks.

If you wanted to, you could also reduce your estimate somewhat by counting the kernel SASS instructions before the first time-stamp storage, and counting the kernel SASS instructions after the last time-stamp storage, and reduce your estimate by a corresponding number of clocks.

Others may have better ideas.

My simplest take on this would be to take a trivial or empty kernel and vary the (large) number of blocks you run by a few millions. I would expect the incremental increase in runtime per block divided by the number of blocks running in parallel to give a similar result to the procedure that txbob suggested, when performed on the same kernel.

Once you verified that, this would allow you to remove the overhead of writing out something per block (unless some part in CUDA is clever enough to optimize away the launch of an empty kernel completely).

Thank you for the comments. txbob and tera.

Below is the kernels that I wrote to measure the overhead of launching a new thread block based on txbob’s comment.

__device__ __inline__ unsigned int __unique_smid()
{
   unsigned int smid;
   asm volatile("mov.u32 %0, %%smid;"	: "=r"(smid));
   return smid;
}

__global__ void empty(clock_t *timer_entry, clock_t *timer_exit, int* dsimd) {
   int bid = blockIdx.y*gridDim.x + blockIdx.x;
   timer_entry[bid] = clock();
   int smid = __unique_smid();
   dsimd[bid] = smid;
   timer_exit[bid] = clock();
}

My system configuration:
OS: Ubuntu 14.04
GPU: Tesla 2075
CUDA 7.0

Result: The minimum latency I got is 2 cycles, which also includes the execution time of instructions before first timestamp and instructions after the last time-stamp. As a result, it seems that the overhead is likely to be zero or negative number. (The number of thread blocks I launched is ranging from hundreds to millions. Obviously, each block only has one thread.)

My question is: why the overhead could be as small as 2 cycles.

Hope my description is clear for you to understand.

@tera: I do not quite understand this sentence, could you please elaborate a little more?

up

Why not provide your complete code?
I also suggested using clock64, not clock.

I believe there is potentially a flaw in my method.

Suppose the actual minimum latency is 4 clocks.

Now suppose a particular SM has 2 blocks that finish close to each other. One block finishes at clock cycle 10, the other finishes at clock cycle 12.

As a result of the block finishing at cycle 10, the SM schedules a new block which starts at cycle 14 (10 + latency). As a result of the block finishing at cycle 12, the SM schedules a new block which starts at cycle 16 (12 + latency).

The minimum measurement would be 14-12 = 2 cycles, even though this is not the actual latency. Instead the measurement you have may be nothing more than scheduling granularity.

A possible method to sort this out would be to guarantee that only 1 block can be scheduled per SM at a time. The simplest way I know of to do this for a GPU like your C2075 would be to use a threadblock/kernel definition that uses perhaps 40KB of shared memory. This will force the peak theoretical occupancy to drop to 1 block per SM.

Note that this particular approach may not work on GPUs like K80, where there is more shared memory available than 48K, for additional blocks. There are a number of GPUs available now that offer more than 48KB of shared memory in aggregate per SM. The 40KB number above should allow the method to work even on GPUs that offer 64KB of shared memory per SM, but will not guarantee 1 block per SM on cc3.7, cc5.2, and cc6.1 GPUs, which offer 96KB of shared memory per SM or more.

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications__technical-specifications-per-compute-capability

My suggestion is to plot the runtime of a (nearly) empty kernel over the number of block launched and to do a linear fit.
The gradient of that plot should equal the overhead of launching a single block, divided by the number of blocks executing in parallel.

Sorry for the late update on this thread.

Currently, the overhead of launching a new thread block is 420 cycles (it includes the execution time of instructions before first timestamp and instructions after the last time-stamp), which is more reasonable than 2 cycles. If you want true overhead, subtract it by the execution time of these instructions before and after time-stamp.

The C2075 has 48KB shared memory. To control only one thread block execute on a SM, allocate 48KB shared memory.

__shared__ char shared[48*1024];

Meanwhile, to avoid compiler optimize away the shared memory allocation, use

shared[0] = 0;

As a result, the kernel becomes the following:

__shared__ char shared[48*1024];
__global__ void empty(clock_t *timer_entry, clock_t *timer_exit, int* dsimd) {
    shared[0] = 0;
   int bid = blockIdx.y*gridDim.x + blockIdx.x;
   timer_entry[bid] = clock();

   int smid = __unique_smid();
   dsimd[bid] = smid;

   timer_exit[bid] = clock();
}

To make sure that the kernel indeed allocate 48KB shared memory, pass --ptxas-options=-v to nvcc to print shared memory usage.

ptxas info    : 304 bytes gmem, 40 bytes cmem[14]
ptxas info    : Compiling entry function '_Z5emptyPlS_Pi' for 'sm_20'
ptxas info    : Function properties for _Z5emptyPlS_Pi
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 12 registers, 49152 bytes smem, 56 bytes cmem[0]

The Kernel indeed use all the 48KB shared memory (49152 bytes smem).