we know integer multiply by 2^n can be replaced by shift operation, which is faster.
As version 1 & 2 shown, I replace tid*4 by tid<<2(I think it can speedup). But the actual result shows version 1 is faster than version 2. I have tested many times.
I was confused.
why shift is slower than integer multiply? my GPU is G260
You might want to look at the assembly generated with decuda.
Three things to consider:
There is a MAD instruction which can compute a * b + c (multiply is 24x24-bit only on SM 1.x devices), but the instruction doing a >> b + c only exists on SM 2.0 devices (Fermi).
SM 1.x devices have special registers and instructions to deal with addresses.
At the hardware level, addresses are in bytes.
So the compiler has to multiply your addresses by 4 (or shift by 2) anyways.
It is likely that in the first case,
des[tid * 4 + 3] = …;
is first turned into a store at byte [des + (tid * 4 + 3) * 4],
then optimized as [des + tid * 16 + 12],
then finally as [des + tid << 4 + 12].
I suspect what happens is: by using an unusual idiom, you prevent the compiler from inferring what you mean, and applying distributivity to fuse both mulitplications/shifts together.
So just stick with multiplication…
Be sure that if a simple and widely-known optimization can be applied safely and has a benefit in every case, then the compiler will use it…
You might want to look at the assembly generated with decuda.
Three things to consider:
There is a MAD instruction which can compute a * b + c (multiply is 24x24-bit only on SM 1.x devices), but the instruction doing a >> b + c only exists on SM 2.0 devices (Fermi).
SM 1.x devices have special registers and instructions to deal with addresses.
At the hardware level, addresses are in bytes.
So the compiler has to multiply your addresses by 4 (or shift by 2) anyways.
It is likely that in the first case,
des[tid * 4 + 3] = …;
is first turned into a store at byte [des + (tid * 4 + 3) * 4],
then optimized as [des + tid * 16 + 12],
then finally as [des + tid << 4 + 12].
I suspect what happens is: by using an unusual idiom, you prevent the compiler from inferring what you mean, and applying distributivity to fuse both mulitplications/shifts together.
So just stick with multiplication…
Be sure that if a simple and widely-known optimization can be applied safely and has a benefit in every case, then the compiler will use it…
See TABLE III of Demystifying GPU Microarchitecture through Microbenchmarking
The throughput of shl/shr is 7.9 with 24 cycles of latency while integer mul’s is 1.7 and 96 cycles latency - there’s no way a mul is faster. It had to be optimized by compiler anyway as Sylvain said.
(Also note the paradox: floating point mul’s throughput is 11.2 so float mul IS kind of faster than bitwise functions)
See TABLE III of Demystifying GPU Microarchitecture through Microbenchmarking
The throughput of shl/shr is 7.9 with 24 cycles of latency while integer mul’s is 1.7 and 96 cycles latency - there’s no way a mul is faster. It had to be optimized by compiler anyway as Sylvain said.
(Also note the paradox: floating point mul’s throughput is 11.2 so float mul IS kind of faster than bitwise functions)
It can be faster, but only under the following conditions:
The compiler can ensure that both operands fit into 24 bits.
The mul is followed by an add.
We’re running on a SM 1.x device.
Strangely enough, they all seem to apply in the OP case.
Float mul being faster than 32-bit int mul (on SM 1.x) makes sense: a float multiplication only requires a 24x24-bit multiplier, which is roughly twice as small as a 32x32-bit multiplier…
It can be faster, but only under the following conditions:
The compiler can ensure that both operands fit into 24 bits.
The mul is followed by an add.
We’re running on a SM 1.x device.
Strangely enough, they all seem to apply in the OP case.
Float mul being faster than 32-bit int mul (on SM 1.x) makes sense: a float multiplication only requires a 24x24-bit multiplier, which is roughly twice as small as a 32x32-bit multiplier…
If I use “__mul24( a, b ) - c”, whether it also can be optimized by compiler? int a,b,c
whether the speed of __mul24( a, b ) +c and a*b+c are equal, as compiler optimize? int a,b,c
If I use “__mul24( a, b ) - c”, whether it also can be optimized by compiler? int a,b,c
whether the speed of __mul24( a, b ) +c and a*b+c are equal, as compiler optimize? int a,b,c
I don’t think there’s a good answer to that. The part of the compiler which does that is proprietary and nobody outside NVIDIA knows how it exactly works. I could make a dummy kernel with that single operation and see if it optimizes, but that’s hardly a clue - compiler may decide otherwise in other cases.
I don’t think there’s a good answer to that. The part of the compiler which does that is proprietary and nobody outside NVIDIA knows how it exactly works. I could make a dummy kernel with that single operation and see if it optimizes, but that’s hardly a clue - compiler may decide otherwise in other cases.