Using shared memory as extra register space Doesnt work?

Ive tried using shared memory as a source of extra register…

Assigning a “private” block of shared memory to every thread in a block and using it as an extension of register space.

Something like :

__shared__ float shValues[DIMBLOCK*4];

would hold 4 variables in shared memory instead of registers.

With no syncthreads whatsoever so it wouldnt slow me down.

I figure i could run into some bank conflicts, but itd still be faster than only running one block per MP

But when i try to do that, the register usage actually goes up.

Does anybody know why?

–edit

Some more volatile fun.

If instead of accessing a shared register with

shValues[threadidx.x+2*DIMBLOCK]

I use

volatile int two = 2;

shValues[threadIdx.x+two*DIMBLOCK]

then the register usage doesnt go up, or down. Its the same as if i had used a register.

So the “2” was stored in a register for the whole execution while the “two” is stored in some constant (hence the …volatile?!?) segment somewhere…

Which brings us back to… what the hell does volatile do?

–edit2

Some more register allocation fun!

volatile float alphamin = 0;

imin=ceilf((x1+alphamin*(x2-x1)-bx)/dx);

repeat for y and z and for x1<x2 and x2<x1…

This way i get 27 registers

stupid way to do things right? alphamin is 0, just skip the whole calculation right?

wrong, it seems.

if i do

imin=ceilf((x1-bx)/dx);

then i hike to 30 registers…

volatile on a local variable does not make much sense, better make your shared memory array volatile.
Also unless you are using marregcount and/or the compiler uses lmem already, optimizing away one register might just mean the compiler will optmize some other code even more by using additional registers.

From my experience the only way to force the kernel to better use registers is to explicitly reuse the variables.
Produces very inconvenient code, however, it works.