Force a variable to be stored in a Register Is there any way to ensure a variable

Short Question: is there any way to force the CUDA compiler to use register memory for a frequently used variable?

Details of my situation:
When I compile with -ptxas-options=-v and -maxrregcount=32, I get the following line of output.

ptxas info: Used 32 registers, 40+0 bytes lmem, 10736+16 bytes smem, 48 bytes

This indicates that local memory is being used for something, although I don’t know exactly what. I suspect that one of my most frequently used variables is being placed in lmem, because writing to it is responsible for 37% of my application runtime. Is there any way to force the compiler to place a variable in register memory and allow other things to get dumped to lmem instead?

corrected a mistake in the reported % of app runtime.
typo correction

There is no such way as far as I know. You can inspect output file with some tool to check what variables were placed in local memory. Btw, looks like you are limitted by shared memory size, only 1 block on gt200, if your block size is not large, you can use more registers. You may try to use shared memory to temporal variable holding, anyway you are not using about 5KB.

You might place variables in shared memory to free registers without spilling to local memory.
Preferably in such a way that no two operands of any instruction come from shared memory, as that apparently would require to move one of them through a register again.

I agree, sounds like you need to free up some registers. Are you sure you need all of those?

Thank you for the replies.

Can you suggest a program that would be able to inspect the code and tell me which variables are being stored in local memory? This would help me gain confidence that my diagnosis is correct.

Yes i think that 1 active block per SM can possibly be more of a bottleneck than spilling into lmem ( unless this is a 1.4 device ). So if you could both decrease lmem and smem usage that would be awesome :)

In some cases you can reduce your register use by thinking like a register allocator. i.e. be aware what the life time and scope of your variables is, andconsider moving variables into a narrower scope (i.e. recompute an index variable instead of using top level scope for the variable). This may be faster than letting the compiler spill some random other variables.

I recently got down from 24 registers to 14 by doing some smart thinking (I had a hard limit of 16 registers because I need 480 threads or so on Compute 1.1 devices).
As a first approach implemented a for loop to summate three contributions to the final solution, which unfortunately exceeded my register limit.

In my case it turned out that I didn’t really need to use temporary variables to accumulate the contributions in registers (shared memory wouldn’t be an option either, I would have needed >16kb). But summating the contributions in global memory was indeed an option for me. And I didn’t need to use a loop - I was able to unroll the entire computation. I placed most variables in a very narrow scope and ended up with 14 registers used.

In your case, I propose to investigate first whether the “volatile trick” can bring any gains. Declare some index and loop variables volatile to see whether you save some lmem or registers in the process.


Thank you all for your replies. I was able to get everything into register memory. Variables are no longer spilling into local memory. Performance has improved, but not as much as I hoped it would.

I do not understand this comment. What do you mean by “one active block per SM”? Can someone please elaborate on why my shared memory usage is slowing down my program. I am pretty much just using shared memory as a cache, so it won’t be difficult for me to reduce the amount that I am using. But why would I want to reduce the size of this cache?

To increase number of threads on one streamig multiprocessor.

I reduced my cache size and now the output is:

ptxas info: Used 32 registers, 8016+16 bytes smem, 48 bytes cmem[0], 40 bytes cmem[1]

Since my device contains 16384 bytes of shared memory per block, my understanding of your advice is that this should permit 2 active blocks per multiprocessor. However, the change resulted in no improvement in execution time. Decreasing the cache size should not have impacted performance significantly. Did I understand the advice correctly?

Yes, you should have two blocks on sm now. Block and grid sizes also matter.

My block size is 512. My grid dimensions are 16 x 16 (256 blocks). Do you see what I need to change?

This means you still run only one block per SM (and have, at this block size, little chance to change that, unless you could get the register count down to 16), so there would be no improvement and you could as well increase your shared memory use to close to 16kb.

It might be beneficial to reduce the block size to 256 threads to run 2 blocks per SM. Whether or not this helps would very much depend on the memory access patterns, so you just have to try.

Other than that, I guess we cannot help you with optimization unless give more insight into what you are doing or post real code.

Thank you tera. This was very good advice. I lowered my block size to 128 and got the register usage under 64 (59 actually). This resulted in an 11% improvement in performance. I would also like to thank all the other posters in this thread. Following the advice and suggestions provided resulted in a 27% improvement in performance. I am creating a new thread about the CUDA profiler tool. Please drop by if you get a chance. - Bill