In addition to what tera said: My usual recommendation is to rely on the compiler defaults, unless there is a very good reason not to. In my experience, in most situations, the heuristics used by the compiler produce close to optimal bounds on register usage (this did not always use to be the case in the early days of CUDA). Tweaking register bounds to optimize performance is what I would classify as a “heroic” optimization.
That said, the constraints of a particular project may require heroic optimizations, so if you decide to tweak performance by manual manipulation of the register limits either via the launch_bounds() attribute or the -maxrregcount compiler flag, please be aware that the compiler evolves continuously and the generated code for non-trivial kernels tends to change from CUDA version to CUDA version. You may therefore have to occasionally check (and re-tweak) your settings if you want to maintain optimal performance.