Ive tried using shared memory as a source of extra register…
Assigning a “private” block of shared memory to every thread in a block and using it as an extension of register space.
Something like :
__shared__ float shValues[DIMBLOCK*4];
would hold 4 variables in shared memory instead of registers.
With no syncthreads whatsoever so it wouldnt slow me down.
I figure i could run into some bank conflicts, but itd still be faster than only running one block per MP
But when i try to do that, the register usage actually goes up.
Does anybody know why?
–edit
Some more volatile fun.
If instead of accessing a shared register with
shValues[threadidx.x+2*DIMBLOCK]
I use
volatile int two = 2;
shValues[threadIdx.x+two*DIMBLOCK]
then the register usage doesnt go up, or down. Its the same as if i had used a register.
So the “2” was stored in a register for the whole execution while the “two” is stored in some constant (hence the …volatile?!?) segment somewhere…
Which brings us back to… what the hell does volatile do?
–edit2
Some more register allocation fun!
volatile float alphamin = 0;
imin=ceilf((x1+alphamin*(x2-x1)-bx)/dx);
repeat for y and z and for x1<x2 and x2<x1…
This way i get 27 registers
stupid way to do things right? alphamin is 0, just skip the whole calculation right?
wrong, it seems.
if i do
imin=ceilf((x1-bx)/dx);
then i hike to 30 registers…