# why shift is slower than integer multiply shift ,integer multiply

we know integer multiply by 2^n can be replaced by shift operation, which is faster.

As version 1 & 2 shown, I replace tid*4 by tid<<2(I think it can speedup). But the actual result shows version 1 is faster than version 2. I have tested many times.

I was confused.

why shift is slower than integer multiply? my GPU is G260

uint tid

version 1

des[tid * 4 + 0] = …;

des[tid * 4 + 1] = …;

des[tid * 4 + 2] = …;

des[tid * 4 + 3] = …;

float4 d;

d.x = __fdividef(des[tid * 4 + 0], len);

d.y = __fdividef(des[tid * 4 + 1], len);

d.z = __fdividef(des[tid * 4 + 2], len);

d.w= __fdividef(des[tid * 4 + 3], len);

d_des[idx * 16 + tid] = d;

version 2

des[(tid<<2)] = …;

des[(tid<<2) + 1] = …;

des[(tid<<2) + 2] = …;

des[(tid<<2) + 3] = …;

float4 d;

d.x = __fdividef(des[(tid<<2)], len);

d.y = __fdividef(des[(tid<<2) + 1], len);

d.z = __fdividef(des[(tid<<2) + 2], len);

d.w = __fdividef(des[(tid<<2) + 3], len);

d_des[(idx<<4) + tid] = d;

There should be no difference at all as the compiler does this substitution for you anyway. How are you timing your code?

There should be no difference at all as the compiler does this substitution for you anyway. How are you timing your code?

I use “cudaEvent_t start, stop” for timing as followed

cudaEventRecord(start, 0);

cudaEventRecord(stop, 0);

cudaEventSynchronize(stop);

cudaEventElapsedTime(&time, start, stop);

and If i use __mul24(tid,4) instead of tid * 4, whether thay are also no difference?

I use “cudaEvent_t start, stop” for timing as followed

cudaEventRecord(start, 0);

cudaEventRecord(stop, 0);

cudaEventSynchronize(stop);

cudaEventElapsedTime(&time, start, stop);

and If i use __mul24(tid,4) instead of tid * 4, whether thay are also no difference?

You might want to look at the assembly generated with decuda.

Three things to consider:

• There is a MAD instruction which can compute a * b + c (multiply is 24x24-bit only on SM 1.x devices), but the instruction doing a >> b + c only exists on SM 2.0 devices (Fermi).
• SM 1.x devices have special registers and instructions to deal with addresses.
• At the hardware level, addresses are in bytes.

So the compiler has to multiply your addresses by 4 (or shift by 2) anyways.
It is likely that in the first case,
des[tid * 4 + 3] = …;
is first turned into a store at byte [des + (tid * 4 + 3) * 4],
then optimized as [des + tid * 16 + 12],
then finally as [des + tid << 4 + 12].

I suspect what happens is: by using an unusual idiom, you prevent the compiler from inferring what you mean, and applying distributivity to fuse both mulitplications/shifts together.
So just stick with multiplication…
Be sure that if a simple and widely-known optimization can be applied safely and has a benefit in every case, then the compiler will use it…

You might want to look at the assembly generated with decuda.

Three things to consider:

• There is a MAD instruction which can compute a * b + c (multiply is 24x24-bit only on SM 1.x devices), but the instruction doing a >> b + c only exists on SM 2.0 devices (Fermi).
• SM 1.x devices have special registers and instructions to deal with addresses.
• At the hardware level, addresses are in bytes.

So the compiler has to multiply your addresses by 4 (or shift by 2) anyways.
It is likely that in the first case,
des[tid * 4 + 3] = …;
is first turned into a store at byte [des + (tid * 4 + 3) * 4],
then optimized as [des + tid * 16 + 12],
then finally as [des + tid << 4 + 12].

I suspect what happens is: by using an unusual idiom, you prevent the compiler from inferring what you mean, and applying distributivity to fuse both mulitplications/shifts together.
So just stick with multiplication…
Be sure that if a simple and widely-known optimization can be applied safely and has a benefit in every case, then the compiler will use it…

See TABLE III of Demystifying GPU Microarchitecture through Microbenchmarking
The throughput of shl/shr is 7.9 with 24 cycles of latency while integer mul’s is 1.7 and 96 cycles latency - there’s no way a mul is faster. It had to be optimized by compiler anyway as Sylvain said.
(Also note the paradox: floating point mul’s throughput is 11.2 so float mul IS kind of faster than bitwise functions)

See TABLE III of Demystifying GPU Microarchitecture through Microbenchmarking
The throughput of shl/shr is 7.9 with 24 cycles of latency while integer mul’s is 1.7 and 96 cycles latency - there’s no way a mul is faster. It had to be optimized by compiler anyway as Sylvain said.
(Also note the paradox: floating point mul’s throughput is 11.2 so float mul IS kind of faster than bitwise functions)

It can be faster, but only under the following conditions:

• The compiler can ensure that both operands fit into 24 bits.

• The mul is followed by an add.

• We’re running on a SM 1.x device.

Strangely enough, they all seem to apply in the OP case.

Float mul being faster than 32-bit int mul (on SM 1.x) makes sense: a float multiplication only requires a 24x24-bit multiplier, which is roughly twice as small as a 32x32-bit multiplier…

It can be faster, but only under the following conditions:

• The compiler can ensure that both operands fit into 24 bits.

• The mul is followed by an add.

• We’re running on a SM 1.x device.

Strangely enough, they all seem to apply in the OP case.

Float mul being faster than 32-bit int mul (on SM 1.x) makes sense: a float multiplication only requires a 24x24-bit multiplier, which is roughly twice as small as a 32x32-bit multiplier…

You’re right, I totally forgot about mad24.

here’s the benchmark:

[codebox]mov.b32 \$r62, %clock

mad24.lo.u32.u16.u16.u32 \$r1, \$r0.hi, \$r0.lo, \$r0

add.b32 \$r3,\$r0,\$r0 //kill forwarding

add.b32 \$r2,\$r1,\$1 //cause RAW stall

mov.b32 \$r63, %clock[/codebox]

It executes 40 cycles, which reveals true mad24 latency of 40-8-8=24 cycles, believe it or not.

Also true with floating mul.

You’re right, I totally forgot about mad24.

here’s the benchmark:

[codebox]mov.b32 \$r62, %clock

mad24.lo.u32.u16.u16.u32 \$r1, \$r0.hi, \$r0.lo, \$r0

add.b32 \$r3,\$r0,\$r0 //kill forwarding

add.b32 \$r2,\$r1,\$1 //cause RAW stall

mov.b32 \$r63, %clock[/codebox]

It executes 40 cycles, which reveals true mad24 latency of 40-8-8=24 cycles, believe it or not.

Also true with floating mul.

If I use “__mul24( a, b ) - c”, whether it also can be optimized by compiler? int a,b,c
whether the speed of __mul24( a, b ) +c and a*b+c are equal, as compiler optimize? int a,b,c

If I use “__mul24( a, b ) - c”, whether it also can be optimized by compiler? int a,b,c
whether the speed of __mul24( a, b ) +c and a*b+c are equal, as compiler optimize? int a,b,c

I don’t think there’s a good answer to that. The part of the compiler which does that is proprietary and nobody outside NVIDIA knows how it exactly works. I could make a dummy kernel with that single operation and see if it optimizes, but that’s hardly a clue - compiler may decide otherwise in other cases.

I don’t think there’s a good answer to that. The part of the compiler which does that is proprietary and nobody outside NVIDIA knows how it exactly works. I could make a dummy kernel with that single operation and see if it optimizes, but that’s hardly a clue - compiler may decide otherwise in other cases.

Thx

Thx

“mad24” is multiply-add instruction for integer on CUDA?

there are lots of “a*b+c”(int a,b,c) in my code. but why I can’t find “mad24” on its ptx?

my Gpu is G260.

How can I employ multiply-add operation of integer on CUDA to speedup my app?