I’ve been observing for a while now that the CUDA compiler happens to choose “stupid”(?) values for the amount of registers, which a kernel should use, since on every Compute Capability just a few values make sense. For example on Compute Capability 3.5 those values are: 32, 40, 48, 56, 64, 72, 80, 96, 128, 168, 255 (8, 16, 24 are probably just good if you want to run multiple kernels simultaneously). All other values result in unused registers and reduce the ability of the SM to hide latencies.
But the compiler doesn’t prefer those values. If the chosen amount of registers is just a little larger than the closest useful value, the kernels often gain performance by limiting the register usage to that value. But in cases, where the choosen value is far off to the next smallest value, limiting decreases the performance. Then the performance would probably penefit if a Kernel would use the next larger useful value as its register usage.
So why does the compiler choose those bad values?
Can I somehow tell the compiler to just use those useful values?
This also leads to my next question: Is there any way to forcefully increase the register usage of a kernel? I’ve got some rather complex kernels with a large working set of automatic variables. Thus those kernels should benefit from an higher ILP, but I cannot force the CUDA compiler to increase it by using more registers; I can just decrease it by limiting the registers (to a useful value if the compiler fails), which is kind of counterproductive.
How did you determine that the register counts chosen by the compiler are “bad”? The compiler uses heuristics to control the register use, differently for each platform. For example, it often uses more registers when code is compiled for an sm_35 target then for an sm_20 target, since sm_35 has a copious number of registers available, while registers are in tight supply on sm_20.
Programmers can override the heuristic by use of the -maxrregcount compiler switch, which limits each kernel in the compilation unit to the designated register count. For finer-grained control of the maximum number of registers used, one can use the launch_bounds() attribute on a per-kernel basis.
I find in my own work that the compiler heuristics work quite well at this time, and that I rarely have to override the compiler choice. A recent case where I did use launch_bounds was for code that without intervention required 34 registers. By squeezing that down to 32 registers I managed to get enough of an increase in occupancy to reduce overall runtime (instruction count increased, as the compiler was forced to re-compute temporaries that otherwise could have been stored in registers).
Specifying higher bounds for register use than what the compiler picks by itself does not seem to make sense. If all of the data already fits into registers, there is no point in using additional registers. Note that the assignment of variables to registers is dynamic, not static. This means that depending on the life-time of variables, N variables may fit into M register, where M < N. Over the duration of a kernel, one register may hold several different variables. As an example, the other day I worked on some code that used 190 32-bit variables at source level, but the number of registers required by the kernel was just 150.
If there’s a kernel and this kernel is pretty much latency bound, using all of the SM’s registers (either by TLP via Occupancy or by ILP, which of both is better depends) will increase the performance. So the compiler should try to use all of those registers for the best performance; if it doesn’t the performance won’t be optimal. So for example let’s say there’s a latency bound Kernel on SM_35 and the compiler chooses to use 41 registers. Since register allocation size is 256 per warp or 8 per thread every thread will use 48 register. Thus 15 Percent of the SM’s registers will be unused, which pretty much interferes with the SM’s ability to hide latencies. Because of that I’d say, that the choice is “bad”.
Sorry, but I’ve observed the contrary. In almost every second or third kernel, which I program, the compiler chooses such a “bad” value so that limiting to one of those good values increases the performance.
The problem is, that not only all of the autmoatic variables should fit into registers, the compiler also needs to allocate enough registers for the intermediary results of your computations and also allocate register of high latency ops (global memory accesses) and issue them as early as possible, so that the kernel has a high enough ILP and isn’t latency bound.
For example let’s say there’s the following Code:
A = B+C+D+E
The compiler would create something like this:
RegA = RegB+RegC
RegA = RegA+RegD
RegA = RegA+RegE
Then there’d be a low ILP. After every OP the warp would stall until the the result of the last operation would be available again. So there are three stalls.
But if the compiler would create something like this:
RegTemp1 = RegB+RegC
RegTemp2 = RegD+RegE
RegA = RegTemp1+RegTemp2
Then there’d be just one stall in the last Op and therefore a much higher ILP, but the register usage increased by one.
Hence, if there are complex kernels, which have a low occupancy, then increasing the ILP is crucial for the performance (at the cost of some more register).
I’m pretty sure the same applies to your 150 register kernel. The IPC, which can be measured within visual profiler, is probably quite low (usually a sign of the kernel being limited by latencies or by bandwidths). Thus again the compiler could use up to 18 more register to increase the ILP without reducing the occupancy and therefore incresing the perofrmance.
If you have specific examples where the current (CUDA 5.5) compiler heuristics cause significantly lower performance than expected, I would encourage you to file bugs, with self-contained repro code attached. The bug reporting form is linked from registered developer website.
Note that Kepler is the first GPU generation where ILP plays a role in performance, and generally I find that it is a second order effect compared to TLP, which is the main latency-hiding mechanism in GPUs. That does not mean one could not find counter examples.
Sadly it doesn’t work as intended. Altough the register usage really increases to RegCount, the compiler doesn’t use the additional registers for the variables of the rest of kernel (confirmed by checking the assembly). Thus the performance also doesn’t change at all.
If you go all the way back to the 4.2 version of the CUDA C Programming Guide you’ll find a list of “register allocation granularity” values for each architecture on pg. 63. This appears to have been removed and buried in the Occupancy Calculator in newer versions of the SDK.
One of the challenges I’ve found with constructing exotic register-precious kernels is that straightforward use of for-loops and locally declared arrays usually ends in tears in all but the simplest compiler-friendly cases. If there is any unexpected spillage of your registers to local memory then you know you are in trouble. The workaround that I’ve repeatedly exploited is to explicitly declare any large set of registers rather than use auto arrays… thus losing the ability to use iterative statements. [ This isn’t a fault of CUDA, rather it’s an issue with C not having any metaprogramming facility. C macros, code generation or using an advanced preprocessor are all workarounds. ]
I’ve found that __launch_bounds() really does work well in the last couple releases. It’s worth experimenting with very explicit bounds to get a feel for how much register count “elasticity” is in your kernel. Sometimes there is no change in the register count at all which might imply that NVCC’s default no-bounds heuristics are not in conflict with the bounds you selected.
Finally, if you’re super-optimizing then, as I’m sure you know, you need to inspect the SASS to see how the compiler is treating your code. Then you get to file lots and lots of compiler bugs and not receive free t-shirts from NVIDIA. :)