why shift is slower than integer multiply shift ,integer multiply

azaonline · June 29, 2010, 6:06am

we know integer multiply by 2^n can be replaced by shift operation, which is faster.

As version 1 & 2 shown, I replace tid*4 by tid<<2(I think it can speedup). But the actual result shows version 1 is faster than version 2. I have tested many times.

I was confused.

why shift is slower than integer multiply? my GPU is G260

uint tid

version 1

des[tid * 4 + 0] = …;

des[tid * 4 + 1] = …;

des[tid * 4 + 2] = …;

des[tid * 4 + 3] = …;

__syncthreads();

…

__syncthreads();

float4 d;

d.x = __fdividef(des[tid * 4 + 0], len[0]);

d.y = __fdividef(des[tid * 4 + 1], len[0]);

d.z = __fdividef(des[tid * 4 + 2], len[0]);

d.w= __fdividef(des[tid * 4 + 3], len[0]);

d_des[idx * 16 + tid] = d;

version 2

des[(tid<<2)] = …;

des[(tid<<2) + 1] = …;

des[(tid<<2) + 2] = …;

des[(tid<<2) + 3] = …;

__syncthreads();

…

__syncthreads();

float4 d;

d.x = __fdividef(des[(tid<<2)], len[0]);

d.y = __fdividef(des[(tid<<2) + 1], len[0]);

d.z = __fdividef(des[(tid<<2) + 2], len[0]);

d.w = __fdividef(des[(tid<<2) + 3], len[0]);

d_des[(idx<<4) + tid] = d;

tera · June 29, 2010, 7:03am

There should be no difference at all as the compiler does this substitution for you anyway. How are you timing your code?

tera · June 29, 2010, 7:03am

There should be no difference at all as the compiler does this substitution for you anyway. How are you timing your code?

azaonline · June 29, 2010, 7:14am

I use “cudaEvent_t start, stop” for timing as followed

cudaEventRecord(start, 0);

…

cudaEventRecord(stop, 0);

cudaEventSynchronize(stop);

cudaEventElapsedTime(&time, start, stop);

and If i use __mul24(tid,4) instead of tid * 4, whether thay are also no difference?

azaonline · June 29, 2010, 7:14am

I use “cudaEvent_t start, stop” for timing as followed

cudaEventRecord(start, 0);

…

cudaEventRecord(stop, 0);

cudaEventSynchronize(stop);

cudaEventElapsedTime(&time, start, stop);

and If i use __mul24(tid,4) instead of tid * 4, whether thay are also no difference?

Sylvain_Collange · June 29, 2010, 8:54am

You might want to look at the assembly generated with decuda.

Three things to consider:

There is a MAD instruction which can compute a * b + c (multiply is 24x24-bit only on SM 1.x devices), but the instruction doing a >> b + c only exists on SM 2.0 devices (Fermi).
SM 1.x devices have special registers and instructions to deal with addresses.
At the hardware level, addresses are in bytes.

So the compiler has to multiply your addresses by 4 (or shift by 2) anyways.
It is likely that in the first case,
des[tid * 4 + 3] = …;
is first turned into a store at byte [des + (tid * 4 + 3) * 4],
then optimized as [des + tid * 16 + 12],
then finally as [des + tid << 4 + 12].

I suspect what happens is: by using an unusual idiom, you prevent the compiler from inferring what you mean, and applying distributivity to fuse both mulitplications/shifts together.
So just stick with multiplication…
Be sure that if a simple and widely-known optimization can be applied safely and has a benefit in every case, then the compiler will use it…

Sylvain_Collange · June 29, 2010, 8:54am

You might want to look at the assembly generated with decuda.

Three things to consider:

There is a MAD instruction which can compute a * b + c (multiply is 24x24-bit only on SM 1.x devices), but the instruction doing a >> b + c only exists on SM 2.0 devices (Fermi).
SM 1.x devices have special registers and instructions to deal with addresses.
At the hardware level, addresses are in bytes.

So the compiler has to multiply your addresses by 4 (or shift by 2) anyways.
It is likely that in the first case,
des[tid * 4 + 3] = …;
is first turned into a store at byte [des + (tid * 4 + 3) * 4],
then optimized as [des + tid * 16 + 12],
then finally as [des + tid << 4 + 12].

I suspect what happens is: by using an unusual idiom, you prevent the compiler from inferring what you mean, and applying distributivity to fuse both mulitplications/shifts together.
So just stick with multiplication…
Be sure that if a simple and widely-known optimization can be applied safely and has a benefit in every case, then the compiler will use it…

wwa · June 29, 2010, 11:04am

See TABLE III of Demystifying GPU Microarchitecture through Microbenchmarking
The throughput of shl/shr is 7.9 with 24 cycles of latency while integer mul’s is 1.7 and 96 cycles latency - there’s no way a mul is faster. It had to be optimized by compiler anyway as Sylvain said.
(Also note the paradox: floating point mul’s throughput is 11.2 so float mul IS kind of faster than bitwise functions)

wwa · June 29, 2010, 11:04am

See TABLE III of Demystifying GPU Microarchitecture through Microbenchmarking
The throughput of shl/shr is 7.9 with 24 cycles of latency while integer mul’s is 1.7 and 96 cycles latency - there’s no way a mul is faster. It had to be optimized by compiler anyway as Sylvain said.
(Also note the paradox: floating point mul’s throughput is 11.2 so float mul IS kind of faster than bitwise functions)

Sylvain_Collange · June 29, 2010, 11:53am

It can be faster, but only under the following conditions:

The compiler can ensure that both operands fit into 24 bits.
The mul is followed by an add.
We’re running on a SM 1.x device.

Strangely enough, they all seem to apply in the OP case.

Float mul being faster than 32-bit int mul (on SM 1.x) makes sense: a float multiplication only requires a 24x24-bit multiplier, which is roughly twice as small as a 32x32-bit multiplier…

Sylvain_Collange · June 29, 2010, 11:53am

It can be faster, but only under the following conditions:

The compiler can ensure that both operands fit into 24 bits.
The mul is followed by an add.
We’re running on a SM 1.x device.

Strangely enough, they all seem to apply in the OP case.

Float mul being faster than 32-bit int mul (on SM 1.x) makes sense: a float multiplication only requires a 24x24-bit multiplier, which is roughly twice as small as a 32x32-bit multiplier…

wwa · June 29, 2010, 1:24pm

You’re right, I totally forgot about mad24.

here’s the benchmark:

[codebox]mov.b32 $r62, %clock

mad24.lo.u32.u16.u16.u32 $r1, $r0.hi, $r0.lo, $r0

add.b32 $r3,$r0,$r0 //kill forwarding

add.b32 $r2,$r1,$1 //cause RAW stall

mov.b32 $r63, %clock[/codebox]

It executes 40 cycles, which reveals true mad24 latency of 40-8-8=24 cycles, believe it or not.

Also true with floating mul.

wwa · June 29, 2010, 1:24pm

You’re right, I totally forgot about mad24.

here’s the benchmark:

[codebox]mov.b32 $r62, %clock

mad24.lo.u32.u16.u16.u32 $r1, $r0.hi, $r0.lo, $r0

add.b32 $r3,$r0,$r0 //kill forwarding

add.b32 $r2,$r1,$1 //cause RAW stall

mov.b32 $r63, %clock[/codebox]

It executes 40 cycles, which reveals true mad24 latency of 40-8-8=24 cycles, believe it or not.

Also true with floating mul.

azaonline · June 29, 2010, 3:33pm

If I use “__mul24( a, b ) - c”, whether it also can be optimized by compiler? int a,b,c
whether the speed of __mul24( a, b ) +c and a*b+c are equal, as compiler optimize? int a,b,c

azaonline · June 29, 2010, 3:33pm

If I use “__mul24( a, b ) - c”, whether it also can be optimized by compiler? int a,b,c
whether the speed of __mul24( a, b ) +c and a*b+c are equal, as compiler optimize? int a,b,c

wwa · June 29, 2010, 5:31pm

I don’t think there’s a good answer to that. The part of the compiler which does that is proprietary and nobody outside NVIDIA knows how it exactly works. I could make a dummy kernel with that single operation and see if it optimizes, but that’s hardly a clue - compiler may decide otherwise in other cases.

wwa · June 29, 2010, 5:31pm

I don’t think there’s a good answer to that. The part of the compiler which does that is proprietary and nobody outside NVIDIA knows how it exactly works. I could make a dummy kernel with that single operation and see if it optimizes, but that’s hardly a clue - compiler may decide otherwise in other cases.

azaonline · June 30, 2010, 2:38am

Thx

azaonline · June 30, 2010, 2:38am

Thx

azaonline · July 1, 2010, 11:00am

“mad24” is multiply-add instruction for integer on CUDA?

there are lots of “a*b+c”(int a,b,c) in my code. but why I can’t find “mad24” on its ptx?

my Gpu is G260.

How can I employ multiply-add operation of integer on CUDA to speedup my app?

Topic		Replies	Views
Blackwell Integer CUDA Programming and Performance	139	2921	June 26, 2025
Poor half performance CUDA Programming and Performance	13	2408	June 19, 2025
fast vector multiply add CUDA Programming and Performance	1	6611	March 29, 2008
FLOAT＆INT speed performance in cuda kernel test DeepStream SDK	2	521	September 11, 2018
Does Blackwell support INT4 native? CUDA Programming and Performance	12	353	April 20, 2025
Matrix multiplication performance issue CUDA Programming and Performance	14	94	June 12, 2025
CUDA image processing Accelaration tips anyone? CUDA Programming and Performance	20	6105	November 16, 2010
Register usage spike in SASS with divison slow/full path CUDA Programming and Performance cuda	10	46	July 11, 2025
Suddenly performance lost CUDA Programming and Performance	22	7538	November 22, 2007
How to get the most dot products of batched vectors out of L4 GPU CUDA Programming and Performance	13	94	May 28, 2025

why shift is slower than integer multiply shift ,integer multiply

Related topics