In general we want CUDA kernel to have balanced execution time. But for some application maybe we want to allow some block to run longer than others, to access different length of memory using same shared memory to do reduced-style operations.
How will the un-balance hurt the execution. For many applications, we do can know in advance how long approximated the running time of each block can be , is it possible to use the estimate running time to hint the scheduler to make better scheduling? e.g Allow longer running time block to run first so that GPU is fully occupied more easily.