How to prevent nvcc from using local memory?


I have pretty heavy kernel with relatively high resource usage:

lmem = 704

smem = 36

reg = 124

After inserting some if()'s I’ve managed to reduce lmem and registers usage (yes, this still works in 1.1…):

lmem = 512

smem = 36

reg = 83

Okay, here’s my question: is there any way to force nvcc not to use local memory? I’m not using arrays with variable index in my kernel. I’m using CUDA 1.1.

I have a kernel that uses 84 bytes lmem, 280 bytes smem, 31 registers. I also do not use any indexing into memory that would explain the use of local memory.

Any way to find out exactly which variables get placed in local memory? And to try to remove it, I just put if() into my kernel?

With decuda you could see what gets placed in local memory. But I think the only way of using less local memory would be to reduce register usage, as you are exceeding the total register capactity; lmem = 512 + reg = 83 that’s 211 “registers” in total.

Can you please explain this in more detail?

I have seen that local-memory is far far slower than global memory.
So, I think you will be better-off having per-thread global memory arrays created and index them with your effective threadId.

DISCLAIMER: I disclaim my theory about local mem. I just came across this once. I am scared of using lcoal memory which can zap out your performance. Just my view. Thats all

But in my case I have 84 / 4 = 21 ‘registers’ in local memory and 31 real registers. That would mean 52 registers and that would be fitting. I have some trouble by the way to correlate the output of decuda and my kernel code. Any hints to make it more easy to correlate the two?

Adding syncthreads might be a possibility I guess and then count how many have passed before I see references to l[$ofs2+0x00000000]

I see btw. this in the ptx :

.local .align 4 .b8 __cuda___cuda___cuda___cuda___cuda___cuda___cuda___cuda___cuda___cuda_result16212212212212212212212212212[28];

	.local .align 4 .b8 __cuda___cuda___cuda___cuda___cuda___cuda___cuda_result48240240240240240240[28];

	.local .align 4 .b8 __cuda___cuda___cuda_result80268268[28];

And I must say that the number 28 is not really something that I can relate to my kernel.

Some more info:

I previously had 2 kernels :
56 bytes lmem, 228 bytes smem, 20 registers
100 bytes lmem, 168 bytes smem, 29 registers

Merged them to prevent reading & writing to global memory and got:
84 bytes lmem, 296 bytes smem, 31 registers

That I already found quite strange, since I would guess that I would get at least 100 bytes lmem.

Then I took the first kernel, started to optimise it (to try and minimize the local memory) and got:
56 bytes lmem, 188 bytes smem, 16 registers
So my optimisations lowered my register usage, but lmem usage was still the same.

When looking through nvcc documentation, I encountered the --use_fast_math option. Tried it (just for fun) and got:
0 bytes lmem, 188 bytes smem, 10 registers

So apparantly using the less accurate functions is lowering my register count and local memory count?? When using the fast math option on my complete project I get wrong results, so it is not really an option. And I also do not understand the reason for this big change in lmem & register usage.

Anybody have a clue?

decuda is only for unix or i dont understand how to use it?

It’s written in Python and cross-platform. Just install Python for Win32.

Is it NOT possible to scope local-variables under “for” , “if” and other compound statements to reduce the usage of local-memory INSTEAD of declaring them under the global function ?

I hope NVCC uses the scoping knowledge to re-use registers and avoid local memory? Any thoughts… ?

I tried that in 1.0 and it didn’t work. Not surprisingly since nvcc performs really agressive optimization, so it’s smart enough to detect scoping without braces.

This is so sad. Appreciate if you could give a try in cuda 1.1 and let us know the result. Thank you.

btw, if you can move local-variable locally under compound statements (like “if”, “while”, “for” etc…) then it is also possible to write code to re-use just one set of global local variables for all such needs. Just that readability would become bad. So, the effective local usage would be the “MAX” of all the locally used local variables.

edit: So, if you are using local variables locally inside compound statements and if you are sure that NVCC is ADDing them instead of UNIONing them then moving the UNIONised local-vars global to the function would reduce your local memory usage.

Anyway, if NVCC scopes it all well in 1.1 (or 1.2) – it would be a boon to programmers.

Take a look at generated PTX — compiler issues new virtual register for each assignment, so I believe your trick with re-using registers won’t work.

I think you are talking about the virtual register when it come to allocation of “local memory”. But that happens only after a register-overflow.

So, if you can reduce the local variable utilization by UNION-izing as I said before, you can reduce the register-usage and thus can try to keep “lmem” at the bay.

That said, I dont really understand the virtual-register thing that you are mentioning.

So, Do you mean to say:

global void mykernel(void)


int i, j;

i= 5;

i= 7; /* Does the compiler use a new register for this assignment? */


btw, jusss 2 kno – Why do you say “everything is reversible”?


Best Regards,


Kernel without writing something to gmem gets optimized out by compiler.

Let’s modify it a little bit:

__global__ void mykernel( int * data1, int * data2 )


int i, j;

i= 5;

data1[threadIdx.x] = i;

i= 7; /* Does the compiler use a new register for this assignment? */

data2[threadIdx.x] = i;


This one generates this PTX code (with default NVCC options):

.entry _Z8mykernelPiS_


	.reg .u16 $rh1;

	.reg .u32 $r1,$r2,$r3,$r4,$r5,$r6,$r7;

	.param .u32 __cudaparm__Z8mykernelPiS__data1;

	.param .u32 __cudaparm__Z8mykernelPiS__data2;

	.loc	13	1	0


	.loc	13	7	0

	mov.u16  $rh1, %tid.x;        	//  

	mul.wide.u16  $r1, $rh1, 4;    //  

	mov.s32  $r2, 5;              	//  

	ld.param.u32  $r3, [__cudaparm__Z8mykernelPiS__data1];	//  id:14 __cudaparm__Z8mykernelPiS__data1+0x0

	add.u32  $r4, $r3, $r1;        //  [$r4+0], $r2;  	//  id:15

	.loc	13	11	0

	mov.s32  $r5, 7;              	// 

	ld.param.u32  $r6, [__cudaparm__Z8mykernelPiS__data2];	//  id:16 __cudaparm__Z8mykernelPiS__data2+0x0

	add.u32  $r7, $r6, $r1;        //  [$r7+0], $r5;  	//  id:17

	exit;                          //  

	} // _Z8mykernelPiS_

You may see that compiler uses two different registers — $r2 and $r5. That’s normal, nvcc is designed to act this way =) As I understand actual register allocation is performed by ptxas. And ptxas doesn’t know about scoping because it works on .ptx files, not on .cu ones.

It’s about reverse engineering programs… I’ve been doing this for quite some time.

Thats definitely enlightening to me. Thanks for that! I did not know that. So the register is locked (for the compiler) between consecutive assignments.

This could be probably to avoid “pipeline” hazards OR to facilitate out of order execution. Nice! That makes sense now.

Now, only an NVCC author can tell us if the compiler scopes the local variables declared under “if” , “for” and other compound statements.

Thank you!!

And, Aah… Reverse eng… Ssssh… ok…

Best Regards,

i have a big kernel

.lmem 1488

.smem 12828

.reg 47

and in disasm it took 1940 lines. I have two questions.

  1. i using 8x8 blocks so, it should be near to 128 registers in usage, am i right? can i increse register usage?

(in asm code i can find some-thing like that “movseg?8.u32 l[$r120], $r14”, so i think - i can not)

  1. it is quite difficult to read 1940 lines of asm. My kernel have a lot of device functions “calls”. i can write some test kernels which will “call” only one device function.

so, will it be the same(or like) result - to disasm device functions all together in big kernel and disasm them in each test kernel ?