How to prevent nvcc from using local memory?

AndreiB · December 6, 2007, 11:41am

Hello,

I have pretty heavy kernel with relatively high resource usage:

lmem = 704

smem = 36

reg = 124

After inserting some if()'s I’ve managed to reduce lmem and registers usage (yes, this still works in 1.1…):

lmem = 512

smem = 36

reg = 83

Okay, here’s my question: is there any way to force nvcc not to use local memory? I’m not using arrays with variable index in my kernel. I’m using CUDA 1.1.

DenisR · February 4, 2008, 9:25am

I have a kernel that uses 84 bytes lmem, 280 bytes smem, 31 registers. I also do not use any indexing into memory that would explain the use of local memory.

Any way to find out exactly which variables get placed in local memory? And to try to remove it, I just put if() into my kernel?

wumpus · February 4, 2008, 9:59am

With decuda you could see what gets placed in local memory. But I think the only way of using less local memory would be to reduce register usage, as you are exceeding the total register capactity; lmem = 512 + reg = 83 that’s 211 “registers” in total.

AndreiB · February 4, 2008, 10:48am

Can you please explain this in more detail?

Sarnath · February 4, 2008, 11:54am

I have seen that local-memory is far far slower than global memory.
So, I think you will be better-off having per-thread global memory arrays created and index them with your effective threadId.

DISCLAIMER: I disclaim my theory about local mem. I just came across this once. I am scared of using lcoal memory which can zap out your performance. Just my view. Thats all

DenisR · February 4, 2008, 12:55pm

But in my case I have 84 / 4 = 21 ‘registers’ in local memory and 31 real registers. That would mean 52 registers and that would be fitting. I have some trouble by the way to correlate the output of decuda and my kernel code. Any hints to make it more easy to correlate the two?

Adding syncthreads might be a possibility I guess and then count how many have passed before I see references to l[$ofs2+0x00000000]

I see btw. this in the ptx :

.local .align 4 .b8 __cuda___cuda___cuda___cuda___cuda___cuda___cuda___cuda___cuda___cuda_result16212212212212212212212212212[28];

	.local .align 4 .b8 __cuda___cuda___cuda___cuda___cuda___cuda___cuda_result48240240240240240240[28];

	.local .align 4 .b8 __cuda___cuda___cuda_result80268268[28];

And I must say that the number 28 is not really something that I can relate to my kernel.

DenisR · February 5, 2008, 10:21am

Some more info:

I previously had 2 kernels :
56 bytes lmem, 228 bytes smem, 20 registers
100 bytes lmem, 168 bytes smem, 29 registers

Merged them to prevent reading & writing to global memory and got:
84 bytes lmem, 296 bytes smem, 31 registers

That I already found quite strange, since I would guess that I would get at least 100 bytes lmem.

Then I took the first kernel, started to optimise it (to try and minimize the local memory) and got:
56 bytes lmem, 188 bytes smem, 16 registers
So my optimisations lowered my register usage, but lmem usage was still the same.

When looking through nvcc documentation, I encountered the --use_fast_math option. Tried it (just for fun) and got:
0 bytes lmem, 188 bytes smem, 10 registers

So apparantly using the less accurate functions is lowering my register count and local memory count?? When using the fast math option on my complete project I get wrong results, so it is not really an option. And I also do not understand the reason for this big change in lmem & register usage.

Anybody have a clue?

FROL · February 13, 2008, 7:56pm

decuda is only for unix or i dont understand how to use it?

AndreiB · February 14, 2008, 6:34am

It’s written in Python and cross-platform. Just install Python for Win32.

Sarnath · February 14, 2008, 7:21am

Is it NOT possible to scope local-variables under “for” , “if” and other compound statements to reduce the usage of local-memory INSTEAD of declaring them under the global function ?

I hope NVCC uses the scoping knowledge to re-use registers and avoid local memory? Any thoughts… ?

AndreiB · February 14, 2008, 7:57am

I tried that in 1.0 and it didn’t work. Not surprisingly since nvcc performs really agressive optimization, so it’s smart enough to detect scoping without braces.

Sarnath · February 14, 2008, 8:06am

This is so sad. Appreciate if you could give a try in cuda 1.1 and let us know the result. Thank you.

btw, if you can move local-variable locally under compound statements (like “if”, “while”, “for” etc…) then it is also possible to write code to re-use just one set of global local variables for all such needs. Just that readability would become bad. So, the effective local usage would be the “MAX” of all the locally used local variables.

edit: So, if you are using local variables locally inside compound statements and if you are sure that NVCC is ADDing them instead of UNIONing them then moving the UNIONised local-vars global to the function would reduce your local memory usage.

Anyway, if NVCC scopes it all well in 1.1 (or 1.2) – it would be a boon to programmers.

AndreiB · February 14, 2008, 8:39am

Take a look at generated PTX — compiler issues new virtual register for each assignment, so I believe your trick with re-using registers won’t work.

Sarnath · February 14, 2008, 9:28am

I think you are talking about the virtual register when it come to allocation of “local memory”. But that happens only after a register-overflow.

So, if you can reduce the local variable utilization by UNION-izing as I said before, you can reduce the register-usage and thus can try to keep “lmem” at the bay.

That said, I dont really understand the virtual-register thing that you are mentioning.

So, Do you mean to say:

global void mykernel(void)

{

int i, j;

i= 5;

i= 7; /* Does the compiler use a new register for this assignment? */

}

btw, jusss 2 kno – Why do you say “everything is reversible”?

Thanks,

Best Regards,

Sarnath

AndreiB · February 14, 2008, 10:07am

Kernel without writing something to gmem gets optimized out by compiler.

Let’s modify it a little bit:

__global__ void mykernel( int * data1, int * data2 )

{

int i, j;

i= 5;

data1[threadIdx.x] = i;

i= 7; /* Does the compiler use a new register for this assignment? */

data2[threadIdx.x] = i;

}

This one generates this PTX code (with default NVCC options):

.entry _Z8mykernelPiS_

	{

	.reg .u16 $rh1;

	.reg .u32 $r1,$r2,$r3,$r4,$r5,$r6,$r7;

	.param .u32 __cudaparm__Z8mykernelPiS__data1;

	.param .u32 __cudaparm__Z8mykernelPiS__data2;

	.loc	13	1	0

$LBB1__Z8mykernelPiS_:

	.loc	13	7	0

	mov.u16  $rh1, %tid.x;        	//  

	mul.wide.u16  $r1, $rh1, 4;    //  

	mov.s32  $r2, 5;              	//  

	ld.param.u32  $r3, [__cudaparm__Z8mykernelPiS__data1];	//  id:14 __cudaparm__Z8mykernelPiS__data1+0x0

	add.u32  $r4, $r3, $r1;        //  

	st.global.s32  [$r4+0], $r2;  	//  id:15

	.loc	13	11	0

	mov.s32  $r5, 7;              	// 

	ld.param.u32  $r6, [__cudaparm__Z8mykernelPiS__data2];	//  id:16 __cudaparm__Z8mykernelPiS__data2+0x0

	add.u32  $r7, $r6, $r1;        //  

	st.global.s32  [$r7+0], $r5;  	//  id:17

	exit;                          //  

	} // _Z8mykernelPiS_

You may see that compiler uses two different registers — $r2 and $r5. That’s normal, nvcc is designed to act this way =) As I understand actual register allocation is performed by ptxas. And ptxas doesn’t know about scoping because it works on .ptx files, not on .cu ones.

It’s about reverse engineering programs… I’ve been doing this for quite some time.

Sarnath · February 14, 2008, 11:49am

Thats definitely enlightening to me. Thanks for that! I did not know that. So the register is locked (for the compiler) between consecutive assignments.

This could be probably to avoid “pipeline” hazards OR to facilitate out of order execution. Nice! That makes sense now.

Now, only an NVCC author can tell us if the compiler scopes the local variables declared under “if” , “for” and other compound statements.

Thank you!!

And, Aah… Reverse eng… Ssssh… ok…

Best Regards,
Sarnath

FROL · February 14, 2008, 4:18pm

i have a big kernel

.lmem 1488

.smem 12828

.reg 47

and in disasm it took 1940 lines. I have two questions.

i using 8x8 blocks so, it should be near to 128 registers in usage, am i right? can i increse register usage?

(in asm code i can find some-thing like that “movseg?8.u32 l[$r120], $r14”, so i think - i can not)

it is quite difficult to read 1940 lines of asm. My kernel have a lot of device functions “calls”. i can write some test kernels which will “call” only one device function.

so, will it be the same(or like) result - to disasm device functions all together in big kernel and disasm them in each test kernel ?

Topic		Replies	Views
Forcing reduced register usage? CUDA Programming and Performance	6	1618	April 24, 2010
Optimizing ptx CUDA Programming and Performance	10	9054	April 24, 2008
Global memory vs register storage How to force the compiler to use registers? CUDA Programming and Performance	6	4998	July 3, 2009
Increasing Register Usage Haven't seen a good discussion on this since CUDA 1.0... CUDA Programming and Performance	12	9061	August 20, 2009
Some questions about PTX and cubin. How to know where my registers are used for? CUDA Programming and Performance	3	15972	July 9, 2010
Local variables and registers CUDA Programming and Performance	13	6196	March 23, 2010
NVCC chooses to use local memory while there is a lot of registers it can use CUDA Programming and Performance	10	1499	January 7, 2022
Use of register An odd problem CUDA Programming and Performance	12	2301	August 12, 2010
Register usage Understanding -ptx and -cubin CUDA Programming and Performance	11	5375	July 24, 2007
why my kernel uses local memory? CUDA Programming and Performance	9	3661	August 21, 2015

How to prevent nvcc from using local memory?

Related topics