Reduce local memory usage

Hi all,

As we known, Jetson TX1 equipped with Maxwell architecture, which supports maximal 256 registers per thread. However, recently, when I compile my kernel, the resources usage is as followed:

ptxas info : Compiling entry function ‘_Z8SparseMMPKiS0_S0_PKfS2_iiPf’ for ‘sm_53’
ptxas info : Used 40 registers, 4096 bytes smem, 352 bytes cmem[0], 4 bytes cmem[2], 132 bytes lmem

Because the access to local memory hurts performance of my kernel significantly. And I don’t know how to avoid using local memory in my code or using some compiling options.

Theoretically, if the usage of local memory is transferred to registers, it needs 132/4=33 registers. Added with 40 registers, the kernel needs only 73 registers, smaller than 255. I am not sure what’s wrong with my compiling.

The compiling command is as followed:
nvcc -O2 -arch=sm_53 -lineinfo -Xptxas -v,-abi=no,-dlcm=cs -c

Thank you for your help.

Hi GD_06,

By default compiler tries to use more registers to avoid spills in local memory. However that is NOT the only source of local memory usage.

If the user program is using array in kernel/device functions in CUDA source which is indexed in loop etc., compiler will be forced to put it in local memory as indexed array can’t be allocated in registers.

In this case, not sure what is the source of local memory usage, is it compiler spilling into local memory (that however seems unlikely since #regs used in low and compiler could have used more registers) or indexed array in the program (which seems most likely the case).
For the latter case, user will need to modify application to not use such indexed array but that may not be easy/possible and cost will need to be paid.

BTW following documentation in CUDA Best Practices guide also document these details which user should be aware of.

Local memory is used only to hold automatic variables. This is done by the nvcc compiler when it determines that there is insufficient register space to hold the variable. Automatic variables that are likely to be placed in local memory are large structures or arrays that would consume too much register space and arrays that the compiler determines may be indexed dynamically.


Hi kayccc,

Thank you for your answer, it really helps me a lot.
For my case, the usage of local memory should come from indexed array access inside for loop.