__device__ causes lmem count to explode Limiting lmem usage in __device__

I have a global function which calls a device function multiple times per thread. Within the device functions, I declare multiple working variables. When I compile using --ptxas-options=-v options, my lmem count gets extremely high. (This is not due to maxrregcount limitation – my register usage is below maxrregcount).

My hunch is that I could solve this by not declaring my work variables in device, but instead as shared arrays (one per threadIdx) in global and passing appropriate references to device. While I know how to declare shared arrays, I am not really sure how to get them used in device. (I am a longtime gentleman programmer, but not of the C variety. I am fairly certain this issue revolves around pointers, but I thought I might have read that pointers and shared memory might not work).

Is my hunch correct?

Are there any examples or documentation for the most efficient way to structure device functions to limit lmem usage? (My lmem explosion is so great that even Fermi will have difficulties).


Edit: Moderator, please feel free to move to CUDA Programming and Development. Apologies.

If you want to keep variables inside a shared memory, you do exactly that: pass a reference! :)

__device__ void function(int &param) {



__global__ void kernel() {

__shared__ int variables[BLOCK_SIZE];

function (variables[threadIdx.x]);



You can also declare references inside the code:


__shared__ int varlables[BLOCK_SIZE];

int &var=variables[threadIdx.x];


__shared__ int warpVariables[BLOCK_SIZE/WARP_SIZE];

int &wvar=warpVariables[threadIdx.x/WARP_SIZE];

Sometimes I use this trick if I want all threads from a warp to use a single value, but threads from different warps use different values and I don’t want to write [threadIdx.x/WARP_SIZE] each time.

I don’t think so. You can declare 10 register variables in device function and call that function 100 times. It won’t use 1000 registers, instead it will simply reuse the old ones. What is more, a different functions may reuse register memory as well. In a PTX, there are no device functions (unless you force those to generate) and everything is inlined. You will see a tremendous use of registers there. But then when GPU code is generated, the compiler really tries to use as little registers as needed.

If you see local memory usage despite some free registers, there may be other reasons to that. Most likely:

  • you use an uninitialised variable. More precisely: there is at least one execution path from variable declaration to variable read without ever setting it, even if you know that this path is never going to be taken. Try initialising all your registers upon their declaration.

  • you use an array with dynamically addressed cells. You cannot do that at run-time with registers. Which registers are used must be known at compile time.