Since the modulo operator is very expensive on GPUs:
I’m wondering if replacing this:
if (threadIdx.x % nextInterval == 0) // …
by this (I know that this will work only in a few cases, but that’s not my point) :
if ((threadIdx.x & (nextInterval-1)) == 0) // …
is faster?
Also I was wondering if:
x <<= 1;
is faster than:
x *= 2;
Of course, both of them are faster, as they would be in any processor.
The first example will only work if nextInterval is a power of 2, of course, but otherwise is a useful optimisation.
For the second example any decent compiler will generate identical code for i*2 and i<<1.
I checked this briefly with nvcc 2.1, and at the PTX level, this optimization is not performed. I don’t know if ptxas performs this optimization because I don’t have decuda handy.
Perhaps if the cost of a multiply and a shift is the same then there is no need for the optimization. One tends to assume that a multiply will be more expensive on a conventional CPU but this may not be true on a GPU core.
It is using the mul.lo.s32 instruction, which according to the programming guide, is actually 4 times slower than a bitwise operation. If ptxas isn’t performing this optimization, this is a pretty obvious thing that nvcc should be doing.
Oh dear - I guess we all get spoiled using decent optimizing compilers these days - time to dust off all those low level C programming tricks again then…
Only if you are compute bound. I have cycles left to burn in 99% of my kernels.
Well if those cycles are going spare I could certainly use some. ;-)