Launch bounds restriction

Hello,

If i set the launch bounds as follow :

__launch_bounds__(maxThreadsPerBlock = 1024, minBlocksPerMultiprocessor)

And the actual number of register is lower than the upper limit, will there be a case where the compiler increases the number of resident block per SM in runtime?

Thanks

Abdoulaye

launch bounds is definitely a hint to the compiler and can influence register usage both up and down, depending on the actual use case.

What do you mean by increasing the number of resident blocks?

The compiler definitely tries to fulfil the boundary conditions with an as high an occupancy as makes sense. You can increase this effect by increasing the minimum blocks parameter.

Those calculations are theoretical and done at compile time.
At runtime the number of registers does not change.

The number of resident blocks can change at runtime, for example, if other kernels are already running on the same SMs.

The launch bounds use a minimum blocks per multiprocessor as parameter. The resident blocks per SM of course can be different/higher than the minimum number.

You are talking about the actual number of registers. The number of registers is a fixed value, which is known at CUDA compile time (which can be during the runtime of the program, e.g. if it is not an architecture, which was compiled yet, and you provide the PTX at runtime).

Or are you talking about a kernel, which needs some amount of registers always, and only conditionally uses more registers, e.g. within an inner if block? And want to use the less registers for more occupancy. Then this is neither known during kernel compilation nor during the runtime invocation. Those registers have to be present and reserved for the kernel.

Small exception for the number of registers being fixed:
The setmaxnreg PTX directive can change the number of registers during a kernel run (for Hopper and Blackwell with no guarantee to be upwards compatible), but not increase the overall number of registers per block.

I was referring to the programming guide and it is explaining the possibility for using higher register usage if GPU resources permit it. But, i was wondering about the number of resident block in order to maximize the occupancy as well. Indeed, the launch bounds arguments are just for setting the upper/lower limit block configuration and the actual one is only decided in runtime.

No, the launch bounds are considered, when compiling and a fixed number of registers is chosen to fulfil those bounds for the target architecture (the minimum blocks for the case, the SM is empty from other kernels). The chosen number also depends on the kernel itself, how easy it is to use less registers.

The result is a certain number of needed registers, similar to maxrregcount. And only this needed number of registers (together with the other resources like shared memory and number of threads per block) contributes to the occupancy.

The direct parameters of the launch bound are considered during compile-time (for determining the number of registers), not during runtime.

It is like a simulation run to find the suitable register numbers.

1 Like