Okay well I was trying to optimise a larger program, but I ended up testing a few lines of code and I noticed that CUDA was using more registers than what was needed.
For example, this uses 2 registers:
float varA; varA = input[threadIdx.x].value; output[threadIdx.x].value = varA * 2;
Now if I have varA as shared then the program requires 3 registers. This I don’t understand as the address should be constant in terms of opcode and should not need another register to work out the address. The number of registers shouldn’t change as all that is different is that I’m now doing an extra read from shared memory into a temporary register.
If I use the shared memory properly (with an array of elements for each thread) then it costs 4 registers to access it:
__shared__ float varA[SHARESIZE]; varA[threadIdx.x] = input[threadIdx.x].value; output[threadIdx.x].value = varA[threadIdx.x] * 2;
I can understand the need for an extra register to compute the address IF it wasn’t constant… but then what was initial extra the register for? Why go from 3 to 4? if that 3rd is already there to hold the address of the variable.
Another issue then is more registers being used when they shouldn’t
__shared__ float varA[SHARESIZE]; varA[threadIdx.x] = input[threadIdx.x].value; output[threadIdx.x].value = varA[threadIdx.x] * 2; varA[threadIdx.x] = input[threadIdx.x].value; output[threadIdx.x].value = varA[threadIdx.x] * 2;
Now this costs 5 registers. This again I don’t understand as the memory is being used directly after access, so it will lock until its ready. In theory the same registers used in the first block should be used in the second block, so why the need for the extra register? Even if I place a __syncthreads() in between them, to force the first block to finish, it still compiles to 5 registers.
I’m also getting the register count from the cubin compile output file.
Also if I force the compiler to use 4 registers it places the 5th into local memory. :(
Any idea’s how to reduce register usage? I was trying to place my variables into shared memory across a block, but if it costs times two registers to do so I may as well not bother. I was actually hoping that by moving variables into shared memory then it would reduce register usage not increase it!