Increasing Register Usage Haven't seen a good discussion on this since CUDA 1.0...

Hi all,

So I have a Kernel that is limited by shared memory usage and NOT registers. Therefore, I would like to take some of the 168 bytes of lmem that is being used, and put that data into the registers (of which I have 32 to use before registers become limiting, per the occupancy calculator).

Since I dont have any arrays in my program (besides those assigned to shared or global mem), what else can I do to make nvcc NOT put data into local mem, and instead put it in to the registers?


Strange, have you been able to determine which variables are put into lmem? Looking at the PTX (using the following NVCC options) might help:

–keep --opencc-options -LIST:source=on


What does --ptxas-options="-v -mem " say?
BTW, sin/cos/… also adds to lmem

Interesting, i do have a good amount of sins and cosines in one function, that could be a culprit. Haven’t gotten a chance to look at the PTX yet as i’m about to run out the door, but I did look at -v -mem:

[codebox]ptxas info : Compiling entry function ‘_Z8MCkerneltPK6xsinfoiP7neutron’

ptxas info : Used 31 registers, 168+0 bytes lmem, 2080+16 bytes smem, 48632 bytes cmem[0], 256 bytes cmem[1]

Memory space statistics for ‘OCG memory pool for function _Z8MCkerneltPK6xsinfoiP7neutron’



Page size : 0x1000 bytes

Total allocated : 0x22e4f00 bytes

Total available : 0x1bb240 bytes

Nrof small block pages : 3648

Nrof large block pages : 3182

Longest free list size : 1

Average free list size : 0

Memory space statistics for ‘Top level ptxas memory pool’


Page size : 0x1000 bytes

Total allocated : 0x58f88 bytes

Total available : 0x26318 bytes

Nrof small block pages : 78

Nrof large block pages : 3

Longest free list size : 1

Average free list size : 0

Memory space statistics for ‘Permanent OCG memory pool’


Page size : 0x1000 bytes

Total allocated : 0xc0160 bytes

Total available : 0x5d90 bytes

Nrof small block pages : 5

Nrof large block pages : 35

Longest free list size : 1

Average free list size : 0

Memory space statistics for ‘PTX parsing state’


Page size : 0x1000 bytes

Total allocated : 0x23cef8 bytes

Total available : 0x1da88 bytes

Nrof small block pages : 535

Nrof large block pages : 8

Longest free list size : 1

Average free list size : 0

Memory space statistics for ‘Command option parser’


Page size : 0x1000 bytes

Total allocated : 0x9108 bytes

Total available : 0x7038 bytes

Nrof small block pages : 9

Nrof large block pages : 0


note: this was edited because I realized i didn’t include -arch sm_13 on the command line. Its been updated.

How many threads do you open per block? if 256 then 31 registers are high and you’re limited by the registers.

In anycase what you can do is try to comment out code in your kernel and see what lines add to the lmem usage and

then you might think how you can decrease it (for example, calculate sin/cos values on the CPU)…

Make sure that while you’re commenting out code, you dont get the compiler to optimize your kernel out…


If you do not mind losing a little bit of precision in some fringe cases, then consider using the __sinf() and __cosf() intrinsics (in case you’re doing single precision maths)

You could study the reference manual to understand the difference between sin() and __sinf() and how it affects your particular use case.


I have 64 threads per block. and it looks like I can now get to 40 registers per thread (sorry my shared mem has changed substantially since this last post, and that changed the number of registers to limiting).

I will have to go through your suggestion later today and see where the culprit is.

I am a little leery about using __sinf and __cosf becuse of the accuracy.

But, how do I guarantee the usage of the fast path as described below? All of my trig operations are on numbers def. smaller than 488039.0f, but thats not something that the compiler would know I dont think…

[codebox]Sine and Cosine

Throughput of __sinf(x), __cosf(x), __expf(x) (see Section C.2) is 1

operation per clock cycle.

sinf(x), cosf(x), tanf(x), sincosf(x) and corresponding double-precision

instructions are much more expensive and even more so if the absolute value of x

needs to be reduced.

More precisely, the argument reduction code (see math_functions.h for

implementation) comprises two code paths referred to as the fast path and the slow

path, respectively.

The fast path is used for arguments sufficiently small in magnitude and essentially

consists of a few multiply-add operations. The slow path is used for arguments large

in magnitude, and consists of lengthy computations required to achieve correct

results over the entire argument range.

At present, the argument reduction code for the trigonometric functions selects the

fast path for arguments whose magnitude is less than 48039.0f for the singleprecision

functions, and less than 2147483648.0 for the double-precision functions.

As the slow path requires more registers than the fast path, an attempt has been

made to reduce register pressure in the slow path by storing some intermediate

variables in local memory, which may affect performance because of local memory

high latency and bandwidth (see Section At present, 28 bytes of local

memory are used by single-precision functions, and 44 bytes are used by doubleprecision

functions. However, the exact amount is subject to change.[/codebox]

Well, I now know for sure that the lmem usage exists because of those math functions, -use_fast_math removes lmem entirely.

It increased run time by ~7%.

I think now I’m going to focus on ekking out that last block per MP (i am at 6, can get to 8 with a slight reduction in shared mem!).

Make sure fast math does not give u wrong answers ;)

Increasing occupancy past 50% typically only provides very tiny performance gains.

Agreed… I did see a change to myvalues, but my code is a Monte Carlo one and so it is stochastic in nature - since the error from fast-math should also be stochastic, I could make an argument that fast math is okay.

However… I won’t make that argument, I will merely include it in my write-up as 'something considered, which results in [TBD with the final version of the code]% increase in performance and x% deviation in the result." That would make it suitable for scoping calculations, where many more are run, but less accuracy is expected.

Agreed, but it looks like I could test it out with only minimal programming effort (i’m talking 10 minutes here).

So we’ll see, Right now I’m at 38%, and would love to get to 50%.