How to set the max register number for each kernel?

I know there is a hardware limit for maximum register number for every thread, which may also affect maximum number of resident blocks on one SM. The compiler flag -maxrregcount seems work globally for every cuda source file. So if I have several kernels, some try to use more threads with fewer registers, others try to use more registers with fewer threads, I cannot find a good limit for both cases.

Is that possible to set the -maxregcount flag for every kernel? Or these kernels should be split into separate source files and then compiled separately with different flags?

Thanks for your replies~

launch bounds:

Thanks every much! That’s just what I’m looking for ~