I’m trying to force register reuse in a kernel loop to conserve registers and increase occupancy.
Here is my pseudo code:
float result;
unsigned int found;
for(i=0 to n)
{
result = calculation1;
found |= result & logic1 << bitshift1;
result = calculation2;
found |= result & logic2 << bitshift2;
.
.
.
}
I would expect this code to use only three registers; one register for result, one register for found, and one register for the loop variable i. However, when I compile this and look at the PTX all of the calculations are pulled outside of the loop. While the PTX code reduces the number of repeat calculations it also bumps the register count past my desired limit. In turn this increased register usage decreases my occupancy.
I want to trade off computational efficiency for occupancy, so I am willing to have all of the calculations performed every time through the loop, as long as my internal loop registers are reused. Does anyone know how to enforce this in my kernel?
Final register assignment is not done at the PTX generation time. In fact, nvcc produces PTX which uses static single assignment form (to facilitate easier optimization later) so counting registers from the PTX output doesn’t tell you anything.
If you want to see how many registers the kernel uses after PTX assembly, add the --ptxas-options=-v option to your nvcc command line.
Are you sure that the occupancy is a problem at all? Unless it is really low, like below 192 threads per MP, a higher occupancy does not generally imply an improvement in performance.
Global memory bound algorithms do benefit from higher occupancy!
@jab_ca: There are few things you can try. Most straightfoward is to add option
-maxrregcount=32
if you want to limit the register usage to 32. This will probably cause some value recomputation and if that fails, register spill into local memory. Do not panic about local memory, rare spills in a kernel of high occupancy are not that harmful.
Other option is to put a ‘volatile’ keyword when declaring register variables. You may try declaring result as volatile, found as volatile or some other registers that are used inside calculation1. Volatile keyword means that the value of a variable is “uncertain” and compiler has to take into account that. That means each time you access the variable, the value must be read and used in the further computation and no optimisation that skips that read may be performed.
On the side note, the above formulation, even if not pushed out of the loop, may use more registers for partial results of your computation.
@seibert/@jjp - I am using blocks of 256 threads, so to utilize all 1024 available threads on my multiprocessor I need each kernel to use 16 registers. Using the --ptxas-options=-v I see each thread is using 29 registers, dropping my utilization to 50%.
@Cygnus X1 - I tried the maxrregcount and it works, however the increased occupancy (100%) doesn’t make up for the increased runtime accessing local memory. I will have to give the volatile keyword a try.
Ok, for lots of kernels the benefit of going from 50% to 100% occupancy is not very large. I would not spend a large amount of time on this issue if there are other avenues for speeding up your code…
I’m curious as to why this would be true. Since I have 256 threads/kernel 100% occupancy means 4 blocks/multiprocessor (256x4 total threads). 50% occupancy means I’m only running 2 blocks of my grid per multiprocessor. It seems like running twice as many blocks should translate to a noticeable speedup.
Because above a certain point, all occupancy can do is help hide memory latency. If you are already hiding memory latency effectively, then there is little to be gained. In a pure computational kernel, I believe that 96 (or maybe 192) threads per multiprocessor is all you need to ensure the pipeline stays full. If there are global memory accesses, then more threads helps keep the multiprocessor busy while those high-latency reads complete. However, at 50% occupancy, the latency is usually hidden quite well (or you’ve bumped into a different limit, like memory bandwidth which occupancy cannot help you with) and more occupancy has diminishing returns.
Many people are confused by the term and think that 50% occupancy means that the stream processors on the chip are idle 50% of the time, but that is not how NVIDIA defined “occupancy” in this case.
A stream multiprocessor has 8 cores. These cores perform work of 32 threads in parallel. So at given time only 32 threads are executed, however when global memory is accessed, the latency is counted in hundred of clock cycles. To hide that, another warp is scheduled. When you increase your occupancy you give a bit more freedom to the scheduler on how to hide the latency, that is all.
In my code, when I moved from 50% to 100% occupancy, I got about 5-10% speedup. Noticeable, but not that much.
I’m not sure if its happening in your code but sometimes the compiler will do loop unrolling automatically if the loop constants are known at compile time.
As far as I can tell it also tends to group address computations together which ends up making your register usage increase by a factor of the unrolling.
I’ve had to manually put #pragma unroll 1 statements before some of my loops to prevent this from occurring.