Possible performance guide on un-balance kernels?

In general we want CUDA kernel to have balanced execution time. But for some application maybe we want to allow some block to run longer than others, to access different length of memory using same shared memory to do reduced-style operations.

How will the un-balance hurt the execution. For many applications, we do can know in advance how long approximated the running time of each block can be , is it possible to use the estimate running time to hint the scheduler to make better scheduling? e.g Allow longer running time block to run first so that GPU is fully occupied more easily.

On compute capability 2.x devices having blocks with varying execution times is fine. On 1.x new blocks are started synchronously, and you should use persistent threads there.

If you have threads with execution times varying wildly (e.g. orders of magnitude), it is a good idea to start long running threads earlier.