Register vs local memory Forcing NVCC to use registers

Dear all,

maybe this is outdated (I am still using version 0.8), but I never came across this topic in the forum.

My kernel is quite complex and needs a lot of shared memory, but only about 30 registers. This

shared memory restricts my execution to run at max. 128 threads per multiprocessor (MP).

From the number of registers per MP (8192), I know that I have up to 64 registers per thread

available in this setup.

When I have an array in my kernel like:

__global__

void myKernel (...)

{

  float fTmp[24];  // use 24 temporary floats

  ...

}

my cubin file tells me that this is allocated as local memory of size 128 bytes. In my case it

would be great if I could force NVCC to take these 24 floats of f4Tmp as registers (I have

34 registers spare!). That would give me a big speed-up since local memory is not cached

and has high latency. Is there any way to tell NVCC to do is my way?

Thanks in advance,

Jake

No. And I am pretty sure this will not work on any other processor architecture either as registers are usually not indexable.

Peter

Jake you will need to change all references to fTmp to have constant indicies then the compiler will leave fTmp in registers. Unfortunately 1.0 still does not unroll loops so if it is in a loop indexed by the loop counter you will have to unroll manually.
Eric

I see… I did not know registers are not indexable. It seems that I have to do the loop
unrollment manually. Well, since device functions are inlined anyway, I guess
my inner loop will become a function with quite some parameters…

Thanks a lot for the answers,
Jake

Would it be possible to break down an array of 24 floats into several smaller arrays? If so, I’d suggest trying that and checking if local mem usage is reduced. Please share what happens if you try this.

I assume the values in the array will change? If not, you could place it in constant memory, which is cached.

Paulius