CUDA local force

santyhyammer · June 22, 2008, 5:00am

Is there any way to force local memory? If I put device float4 myLocalVar the compiler complains… I need a way to control if I want to use registers or device local memory… I could use global device and index it… but will be harder to program…

thx

SPWorley · June 22, 2008, 6:46am

Why do you want to force local memory? It’s inferior to registers in almost every way.

The most common question is the reverse, “how can I stop the compiler from making my routines use local memory?”

But, with regards to your question,

device float4 myVar;

will work at FILE scope, but it is not supported at FUNCTION scope.

I think this is what you mean by your last sentence. But this is still straightforward to use by using the file-scoped variable only in your one function. You don’t need to worry about the potential complication of recursion, since recursion isn’t supported anyway.

__shared__ float4 LocalForFoo;

int Foo(int bar) 

{

    // use LocalForFoo here

}

santyhyammer · June 22, 2008, 3:24pm

I need the two things :D… to stop the compiler to use local memory to save registers and also to force local because is more easy to program than global indexing ( and you can avoid coalescing better ).

wumpus · June 25, 2008, 12:21pm

If define an array locally in your kernel, and address it dynamically, it will always be placed in local memory. (as registers do’nt support dynamic addressing)

santyhyammer · June 25, 2008, 3:29pm

That’s a good advice, thx! That solves partially the “hey Mr.compiler stop putting register where I don’t want”… but the “hey Mr.compiler, stop putting on local device memory the things I want in register” needs a solution yet.

For example, I have this structure:

struct Foo

{

    float4 a, b, c, d;

};

__global__ myKernel()

{

    Foo myFooStruct;

    myFooStruct.a = tex1Dfetch(blah blah);

}

The nvcc compiler puts the Foo structure in local memory to save registers… resulting in very very slow performance. Even if I separate the kernel into

__global__ myKernel()

{

    float4 a, b, c, d;

    a = tex1Dfetch(blah blah);

    b = tex1Dfetch(blah blah);

   ....

}

It moves some float4 components(not all) into the local device memory to save registers… the problem is that I cannot use more than 256 threads per block … and the compiler is still trying to optimize the register usage(which is 24 currently… more than enough for 256 threads… the kernel is limited by the amount of shared memory… not by registers )… so it messes the speed a lot… I think a new pragma ( forceregister ) is needed…

Reimar · June 26, 2008, 6:21am

Have you tried the standard C keyword for that, “register”? I certainly do not want NVidia to reinvent existing C functionality…

santyhyammer · June 26, 2008, 1:58pm

Yep, I tried but it does nothing.

Topic		Replies	Views
Forcing the compiler to place variables in registers and not in local memory CUDA Programming and Performance	6	2289	April 3, 2019
Global memory vs register storage How to force the compiler to use registers? CUDA Programming and Performance	6	4985	July 3, 2009
Local memory? CUDA Programming and Performance	6	5072	April 23, 2007
How to force declaring a variable as register? CUDA Programming and Performance	6	8154	April 21, 2008
temporary memory issues CUDA Programming and Performance	11	5307	March 30, 2008
Local memory / register bank CUDA Programming and Performance	9	841	October 29, 2010
How is memory type chosen for stack variable? CUDA Programming and Performance	5	6158	November 5, 2007
CUDA 4.1 vs. 3.2 register allocation... CUDA Programming and Performance	6	1493	April 24, 2012
local thread memory & compiller CUDA Programming and Performance	12	2941	September 26, 2008
Avoiding local memory with structs CUDA Programming and Performance	2	4388	January 26, 2010

CUDA __local__ force

Related topics

CUDA local force