I know there is a hardware limit for maximum register number for every thread, which may also affect maximum number of resident blocks on one SM. The compiler flag -maxrregcount seems work globally for every cuda source file. So if I have several kernels, some try to use more threads with fewer registers, others try to use more registers with fewer threads, I cannot find a good limit for both cases.
Is that possible to set the -maxregcount flag for every kernel? Or these kernels should be split into separate source files and then compiled separately with different flags?
Thanks for your replies~