This is a follow-up to my last post and I am actually getting really puzzled by the cuda compiler behavior.
So, first case, I have a couple of calculations which are of following type:
var_x -= var_y * var_z;
Now, this ‘-=’ operation uses one more register than the usual arithmetic operation. So, I have 27 such operations one after the other and I am using crazy amount of shared memory (38). I moved variables in shared memory whatever I could and different combinations but I am unable to bring the register count down.
Is there some trick for this kind of arithmetic operations? I can post my kernel if it helps.