implications of the default setting(0) for Max Used Register (maxrregcount)

The CUDA C/C++ device properties has the Max Used Register set to 0.

What exactly does that mean in this context, and in general is this something I should be thinking about?

I have noticed small differences in running time when I changed that value, but nothing significant.

Visual studio 2010 x64 with a K20c.

Is there method to determining the optimal value for this property?

-maxrregcount is mostly an obsolete compatibility setting. Nowadays you would use launch_bounds() directives on individual kernels right in the source code.

Limiting the number of registers used can be useful to increase occupancy, which may or may not make your kernel run faster. You can use the Occupancy Calculator spreadsheet to see if reducing the register count could improve occupancy. If you can improve occupancy by a small reduction in register count and that does not lead to a significant number of register spills (you get that information from nvcc by compiling with the –ptxas-options=-v flag), it’s just a matter of trying out and checking whether that improves performance.

In addition to what tera said: My usual recommendation is to rely on the compiler defaults, unless there is a very good reason not to. In my experience, in most situations, the heuristics used by the compiler produce close to optimal bounds on register usage (this did not always use to be the case in the early days of CUDA). Tweaking register bounds to optimize performance is what I would classify as a “heroic” optimization.

That said, the constraints of a particular project may require heroic optimizations, so if you decide to tweak performance by manual manipulation of the register limits either via the launch_bounds() attribute or the -maxrregcount compiler flag, please be aware that the compiler evolves continuously and the generated code for non-trivial kernels tends to change from CUDA version to CUDA version. You may therefore have to occasionally check (and re-tweak) your settings if you want to maintain optimal performance.

Thanks both of you for the information. AT this point it seems best to focus on other optimizations.

In general I have been very impressed with the performance of the K20 with ints, float and doubles.

I don’t know what your application looks like, but as a general observation, Kepler provides a massive increase in FLOPS compared to Fermi, while memory the bandwidth grew more modestly. As a consequence it becomes increasingly important to use memory efficiently at all levels of the memory hierarchy. For applications that are compute-bound, I usually focus on simply minimizing dynamic instruction count (plus minimizing synchronization). For floating-point computations in particular many transformations are not value preserving and thus cannot be applied automatically by the compiler.