CUDA Pro Tip: Minimize the Tail Effect

Originally published at:

When I work on the optimization of CUDA kernels, I sometimes see a discrepancy between Achieved and Theoretical Occupancies. The Theoretical Occupancy is the ratio between the number of threads which may run on each multiprocessor (SM) and the maximum number of executable threads per SM (2048 on the Kepler architecture). This value is estimated…

You say the GPU arranges blocks in a grid into waves, and allocates them to SMs on a per-wave basis, not a block-by-block basis.
Does this mean an idle SM with free resources will not be assigned a ready block until every SM on the device is able to accept a new block?

Waves are an easy abstraction but the work is launched on a
block-by-block basis (so, the answer to your question is no). If you
have a grid of blocks which leads to a couple of full waves. You may
still have a strong tail effect if a few blocks are significantly longer
than the others. It's a rather classical scheduling problem.