Two questions (I'm a newbie)

Since the modulo operator is very expensive on GPUs:

I’m wondering if replacing this:

if (threadIdx.x % nextInterval == 0) // …

by this (I know that this will work only in a few cases, but that’s not my point) :

if ((threadIdx.x & (nextInterval-1)) == 0) // …

is faster?

Also I was wondering if:

x <<= 1;

is faster than:

x *= 2;

Of course, both of them are faster, as they would be in any processor.

The first example will only work if nextInterval is a power of 2, of course, but otherwise is a useful optimisation.

For the second example any decent compiler will generate identical code for i*2 and i<<1.

I checked this briefly with nvcc 2.1, and at the PTX level, this optimization is not performed. I don’t know if ptxas performs this optimization because I don’t have decuda handy.

Perhaps if the cost of a multiply and a shift is the same then there is no need for the optimization. One tends to assume that a multiply will be more expensive on a conventional CPU but this may not be true on a GPU core.

It is using the mul.lo.s32 instruction, which according to the programming guide, is actually 4 times slower than a bitwise operation. If ptxas isn’t performing this optimization, this is a pretty obvious thing that nvcc should be doing.

Oh dear - I guess we all get spoiled using decent optimizing compilers these days - time to dust off all those low level C programming tricks again then…

Only if you are compute bound. I have cycles left to burn in 99% of my kernels.

Well if those cycles are going spare I could certainly use some. ;-)

You make me jealous :D