CUDA __local__ force

Is there any way to force local memory? If I put device float4 myLocalVar the compiler complains… I need a way to control if I want to use registers or device local memory… I could use global device and index it… but will be harder to program…


Why do you want to force local memory? It’s inferior to registers in almost every way.

The most common question is the reverse, “how can I stop the compiler from making my routines use local memory?”

But, with regards to your question,

device float4 myVar;

will work at FILE scope, but it is not supported at FUNCTION scope.

I think this is what you mean by your last sentence. But this is still straightforward to use by using the file-scoped variable only in your one function. You don’t need to worry about the potential complication of recursion, since recursion isn’t supported anyway.

__shared__ float4 LocalForFoo;

int Foo(int bar) 


    // use LocalForFoo here


I need the two things :D… to stop the compiler to use local memory to save registers and also to force local because is more easy to program than global indexing ( and you can avoid coalescing better ).

If define an array locally in your kernel, and address it dynamically, it will always be placed in local memory. (as registers do’nt support dynamic addressing)

That’s a good advice, thx! That solves partially the “hey Mr.compiler stop putting register where I don’t want”… but the “hey Mr.compiler, stop putting on local device memory the things I want in register” needs a solution yet.

For example, I have this structure:

struct Foo


    float4 a, b, c, d;


__global__ myKernel()


    Foo myFooStruct;

    myFooStruct.a = tex1Dfetch(blah blah);


The nvcc compiler puts the Foo structure in local memory to save registers… resulting in very very slow performance. Even if I separate the kernel into

__global__ myKernel()


    float4 a, b, c, d;

    a = tex1Dfetch(blah blah);

    b = tex1Dfetch(blah blah);



It moves some float4 components(not all) into the local device memory to save registers… the problem is that I cannot use more than 256 threads per block … and the compiler is still trying to optimize the register usage(which is 24 currently… more than enough for 256 threads… the kernel is limited by the amount of shared memory… not by registers )… so it messes the speed a lot… I think a new pragma ( forceregister ) is needed…

Have you tried the standard C keyword for that, “register”? I certainly do not want NVidia to reinvent existing C functionality…

Yep, I tried but it does nothing.