# Arithmetic Operations benchmarking with CUDA FERMI Understanding pure performance of arithmetic on F

Hello everybody,
I need to figure out several arithmetic performance questions. I need to get the pure performance measure of FERMI arithmetic.

1. Is it possible on FERMI GPUs to use in parallel the FPU and the Integer ALU?
2. If I have to perform a large number of additions or/and multiplications what is the speed up given by using both of them (can I execute two operations in parallel, ACTUALLY)?
3. How much is the 32 bit multiplication operation slower than the 32 bit addition using integer ALU?
4. How much is the FPU multiplication operation slower than the FPU addition?
5. How much are Integer arithmetic operations faster than the the same operations on the FPU?

I tried to write down benchmarks, but it seems that the Integer multiplication ii just a bit slower than the addition.
What is the best way to perform such measurments?
I used a single thread with a for cycle that contains 128 operations in which the next addition(multiplication) depends on the result of the previous one.

Hello everybody,
I need to figure out several arithmetic performance questions. I need to get the pure performance measure of FERMI arithmetic.

1. Is it possible on FERMI GPUs to use in parallel the FPU and the Integer ALU?
2. If I have to perform a large number of additions or/and multiplications what is the speed up given by using both of them (can I execute two operations in parallel, ACTUALLY)?
3. How much is the 32 bit multiplication operation slower than the 32 bit addition using integer ALU?
4. How much is the FPU multiplication operation slower than the FPU addition?
5. How much are Integer arithmetic operations faster than the the same operations on the FPU?

I tried to write down benchmarks, but it seems that the Integer multiplication ii just a bit slower than the addition.
What is the best way to perform such measurments?
I used a single thread with a for cycle that contains 128 operations in which the next addition(multiplication) depends on the result of the previous one.

1.) No, ALUs execute either a float or int instruction (!=CPU)

2.) Hardware executes FMA(Fused-Multiply-Add) instructions. Ensure that your code compiles to those whenever possible. A separate addition and multiplication cannot be executed in parallel, it will take twice as long as one FMA(!=CPU).

3.) Some int instructions can only execute on a half of the ALUs(MAD, SAD for sure). Not sure about MUL. Maybe it can execute on both, but latency is higher than ADD.

4.) Both have same speed(think of a FMA instruction with a wasted MUL/ADD part for ADD/MUL)

5.) int/float are same speed usually. Exception: mad can only execute only on half of the alus and has higher latency, but FMA can execute on all alus.

DIV is very expensive.

I used a single thread with a for cycle that contains 128 operations in which the next addition(multiplication) depends on the result of the previous one.

=> In this case you are latency bound.

=> measure throughput, use operations without dependency to previous results.

1.) No, ALUs execute either a float or int instruction (!=CPU)

2.) Hardware executes FMA(Fused-Multiply-Add) instructions. Ensure that your code compiles to those whenever possible. A separate addition and multiplication cannot be executed in parallel, it will take twice as long as one FMA(!=CPU).

3.) Some int instructions can only execute on a half of the ALUs(MAD, SAD for sure). Not sure about MUL. Maybe it can execute on both, but latency is higher than ADD.

4.) Both have same speed(think of a FMA instruction with a wasted MUL/ADD part for ADD/MUL)

5.) int/float are same speed usually. Exception: mad can only execute only on half of the alus and has higher latency, but FMA can execute on all alus.

DIV is very expensive.

I used a single thread with a for cycle that contains 128 operations in which the next addition(multiplication) depends on the result of the previous one.

=> In this case you are latency bound.

=> measure throughput, use operations without dependency to previous results.

1. So you mean that on one core I can use the either the FPU or the ALU at the same time? So if I have to execute 1000 additions is useless to use schedule 500 on the ALU and

500 on the FPU (no speed-up)?

2-3) I am not interested in execute FMA, I just want to compare the multiplication latency with the addition latency to understand what is the difference.

You mean that multiplications are executed as part of an FMA and the same holds for additions?

1. So you mean that on one core I can use the either the FPU or the ALU at the same time? So if I have to execute 1000 additions is useless to use schedule 500 on the ALU and

500 on the FPU (no speed-up)?

2-3) I am not interested in execute FMA, I just want to compare the multiplication latency with the addition latency to understand what is the difference.

You mean that multiplications are executed as part of an FMA and the same holds for additions?

there’s a summary of instruction throughputs in the SDK. i believe it’s in the programming guide. to convert it to clock cycles just divide 32 (#of threads in a warp) by the throughput.

i believe they use “throughput” because it’s technically more accurate, since latency is masked by temporal multithreading (a.k.a. thread scheduling).

there’s a summary of instruction throughputs in the SDK. i believe it’s in the programming guide. to convert it to clock cycles just divide 32 (#of threads in a warp) by the throughput.

i believe they use “throughput” because it’s technically more accurate, since latency is masked by temporal multithreading (a.k.a. thread scheduling).

1.) Correct. Either one int or float instruction is executed by one ALU each cycle(There are no two dedicated fpu/int units like on the CPU).

2-3.) Correct. AFAIK there is no performance gain(neither throughput nor latency) in using “MUL x,y,z” instead of “FMA x,y,z,0.0f” (or “ADD x,y,z” instead “FMA x, 1.0f, y,z”). At least on GF100(GTX470/480).

(On GF104(GTX460) the situation is a more complicated with the superscalar execution. I read on some forum(beyond3d?) that there is not enough register file bandwidth supply all 9 input operands for 3 concurrent FMA instructions. In that situtation, 2FMA+1MUL for example would be faster. )

Of course, as happyjack272 said, latency is normally not important on a GPU, when there are enough active threads to execute.

1.) Correct. Either one int or float instruction is executed by one ALU each cycle(There are no two dedicated fpu/int units like on the CPU).

2-3.) Correct. AFAIK there is no performance gain(neither throughput nor latency) in using “MUL x,y,z” instead of “FMA x,y,z,0.0f” (or “ADD x,y,z” instead “FMA x, 1.0f, y,z”). At least on GF100(GTX470/480).

(On GF104(GTX460) the situation is a more complicated with the superscalar execution. I read on some forum(beyond3d?) that there is not enough register file bandwidth supply all 9 input operands for 3 concurrent FMA instructions. In that situtation, 2FMA+1MUL for example would be faster. )

Of course, as happyjack272 said, latency is normally not important on a GPU, when there are enough active threads to execute.