Overhead of launching a new thread block

LongY · November 15, 2016, 4:02am

As we know, when a thread block terminates, global block scheduler picks and schedules a new thread block to an SM.

My question is: how to measure the time of launching a new thread block.

My initial idea is timing execution of an empty kernel global void empty{ }, which can measure the overhead of launching a kernel using only one thread. As for overhead of launching a thread block, I am not sure how to measure it.

Any pointer is welcome. Thank you.

LongY · November 16, 2016, 11:02pm

Hello community,

I guess my previous description is not clear enough, so I edit my question to make it more clear.

Robert_Crovella · November 16, 2016, 11:25pm

I can imagine a method to estimate (upper bound) the minimum latency for threadblock scheduling. I’m not sure if that is what you want.

In a nutshell, launch a kernel with lots of threadblocks.

Each threadblock can consist of a single thread.

Each thread should:

read the per SM clock64 counter
store that in an array (entry time-stamp) associated with that SM. google get_smid to find out how to determine which SM you are on
read the per SM clock64 counter again
store that in another array (exit time-stamp) associated with that SM.
exit

You may want to use atomics or some other method to get the “next entry” pointer for storage into the array.

Then for each SM post-process the arrays, looking for the minimum positive timestamp differece between any timestamp in the entry array and any timestamp in the exit array.

For that SM, this minimum timestamp difference becomes your upper-bound estimate of the minimum scheduling time, as measured in SM clocks.

If you wanted to, you could also reduce your estimate somewhat by counting the kernel SASS instructions before the first time-stamp storage, and counting the kernel SASS instructions after the last time-stamp storage, and reduce your estimate by a corresponding number of clocks.

Others may have better ideas.

tera · November 17, 2016, 2:01am

My simplest take on this would be to take a trivial or empty kernel and vary the (large) number of blocks you run by a few millions. I would expect the incremental increase in runtime per block divided by the number of blocks running in parallel to give a similar result to the procedure that txbob suggested, when performed on the same kernel.

Once you verified that, this would allow you to remove the overhead of writing out something per block (unless some part in CUDA is clever enough to optimize away the launch of an empty kernel completely).

LongY · November 20, 2016, 8:14am

Thank you for the comments. txbob and tera.

Below is the kernels that I wrote to measure the overhead of launching a new thread block based on txbob’s comment.

__device__ __inline__ unsigned int __unique_smid()
{
   unsigned int smid;
   asm volatile("mov.u32 %0, %%smid;"	: "=r"(smid));
   return smid;
}

__global__ void empty(clock_t *timer_entry, clock_t *timer_exit, int* dsimd) {
   int bid = blockIdx.y*gridDim.x + blockIdx.x;
   timer_entry[bid] = clock();
   int smid = __unique_smid();
   dsimd[bid] = smid;
   timer_exit[bid] = clock();
}

My system configuration:
OS: Ubuntu 14.04
GPU: Tesla 2075
CUDA 7.0

Result: The minimum latency I got is 2 cycles, which also includes the execution time of instructions before first timestamp and instructions after the last time-stamp. As a result, it seems that the overhead is likely to be zero or negative number. (The number of thread blocks I launched is ranging from hundreds to millions. Obviously, each block only has one thread.)

My question is: why the overhead could be as small as 2 cycles.

Hope my description is clear for you to understand.

LongY · November 20, 2016, 8:21am

@tera: I do not quite understand this sentence, could you please elaborate a little more?

LongY · November 20, 2016, 8:27am

up

Robert_Crovella · November 20, 2016, 3:16pm

Why not provide your complete code?
I also suggested using clock64, not clock.

I believe there is potentially a flaw in my method.

Suppose the actual minimum latency is 4 clocks.

Now suppose a particular SM has 2 blocks that finish close to each other. One block finishes at clock cycle 10, the other finishes at clock cycle 12.

As a result of the block finishing at cycle 10, the SM schedules a new block which starts at cycle 14 (10 + latency). As a result of the block finishing at cycle 12, the SM schedules a new block which starts at cycle 16 (12 + latency).

The minimum measurement would be 14-12 = 2 cycles, even though this is not the actual latency. Instead the measurement you have may be nothing more than scheduling granularity.

A possible method to sort this out would be to guarantee that only 1 block can be scheduled per SM at a time. The simplest way I know of to do this for a GPU like your C2075 would be to use a threadblock/kernel definition that uses perhaps 40KB of shared memory. This will force the peak theoretical occupancy to drop to 1 block per SM.

Note that this particular approach may not work on GPUs like K80, where there is more shared memory available than 48K, for additional blocks. There are a number of GPUs available now that offer more than 48KB of shared memory in aggregate per SM. The 40KB number above should allow the method to work even on GPUs that offer 64KB of shared memory per SM, but will not guarantee 1 block per SM on cc3.7, cc5.2, and cc6.1 GPUs, which offer 96KB of shared memory per SM or more.

[url]Programming Guide :: CUDA Toolkit Documentation

tera · November 20, 2016, 5:30pm

My suggestion is to plot the runtime of a (nearly) empty kernel over the number of block launched and to do a linear fit.
The gradient of that plot should equal the overhead of launching a single block, divided by the number of blocks executing in parallel.

LongY · December 1, 2016, 12:35am

Sorry for the late update on this thread.

Currently, the overhead of launching a new thread block is 420 cycles (it includes the execution time of instructions before first timestamp and instructions after the last time-stamp), which is more reasonable than 2 cycles. If you want true overhead, subtract it by the execution time of these instructions before and after time-stamp.

The C2075 has 48KB shared memory. To control only one thread block execute on a SM, allocate 48KB shared memory.

__shared__ char shared[48*1024];

Meanwhile, to avoid compiler optimize away the shared memory allocation, use

shared[0] = 0;

As a result, the kernel becomes the following:

__shared__ char shared[48*1024];
__global__ void empty(clock_t *timer_entry, clock_t *timer_exit, int* dsimd) {
    shared[0] = 0;
   int bid = blockIdx.y*gridDim.x + blockIdx.x;
   timer_entry[bid] = clock();

   int smid = __unique_smid();
   dsimd[bid] = smid;

   timer_exit[bid] = clock();
}

To make sure that the kernel indeed allocate 48KB shared memory, pass --ptxas-options=-v to nvcc to print shared memory usage.

ptxas info    : 304 bytes gmem, 40 bytes cmem[14]
ptxas info    : Compiling entry function '_Z5emptyPlS_Pi' for 'sm_20'
ptxas info    : Function properties for _Z5emptyPlS_Pi
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 12 registers, 49152 bytes smem, 56 bytes cmem[0]

The Kernel indeed use all the 48KB shared memory (49152 bytes smem).

Topic		Replies	Views
What's the cost of loading in blocks? CUDA Programming and Performance	3	2360	April 9, 2008
reduce overhead of launching a new thread block CUDA Programming and Performance	15	4809	February 15, 2018
fundamental cuda kernel launch questions CUDA Programming and Performance	2	16542	July 31, 2008
kernel launch overhead for GTX 280 CUDA Programming and Performance	17	3780	November 5, 2009
Overhead of launching thread blocks CUDA Programming and Performance	3	4592	September 2, 2008
Overhead for SM to switch to other set of blocks ? CUDA Programming and Performance	4	5594	August 28, 2008
kernel launch overhead timing best practices CUDA Programming and Performance	3	10049	June 24, 2014
Kernel enqueue overhead Bringing kernel overhead down? CUDA Programming and Performance	9	13843	March 12, 2010
Kernel execution overhead CUDA Programming and Performance	2	1196	July 6, 2009
kernel call overhead: timing results overhead is large for small # of calls CUDA Programming and Performance	16	7937	March 8, 2013

Overhead of launching a new thread block

Related topics