FPU and ALU multiplications in parallel Can I take advantage using both of them?

vonneumann · November 1, 2010, 12:40pm

Hello,
going on with my arithmetic benchmarking on a FERMI CPU I noticed that perforing a certain amount of multiplications using integer arithmetic, let us say 100 000 multiplications, it takes a certain time T

If I execute 100 000 integer mulitplications and 100 000 FPU multiplications the execution time does not double and I have something far less than 2T

So I was wondering if I can take advantage in my applications (that involve multi precsions high performance arithmetic) of using both FPU and ALU
to perform multiplication, since it seems that I can exploit a certain level of parallelism.

P.S. I measured that the same does not happen for the additions

vonneumann · November 1, 2010, 12:40pm

Hello,
going on with my arithmetic benchmarking on a FERMI CPU I noticed that perforing a certain amount of multiplications using integer arithmetic, let us say 100 000 multiplications, it takes a certain time T

If I execute 100 000 integer mulitplications and 100 000 FPU multiplications the execution time does not double and I have something far less than 2T

So I was wondering if I can take advantage in my applications (that involve multi precsions high performance arithmetic) of using both FPU and ALU
to perform multiplication, since it seems that I can exploit a certain level of parallelism.

P.S. I measured that the same does not happen for the additions

happyjack272 · November 1, 2010, 3:31pm

there are separate integer and fpu pipes.

[url=“http://www.pcper.com/article.php?aid=954”]http://www.pcper.com/article.php?aid=954[/url]

i guess the core can fill them both at the same time.

perhaps you don’t get the addition speed gain because addition has higher throughput, so how fast the core can fill the pipes becomes the limiting factor. i.e. the core can’t fill fast enough to fill both pipes with additions. but it can do so for multiplications because they have only half the throughput.

this is only an educated guess. i don’t actually know the technical details for sure. but it seems logical.

happyjack272 · November 1, 2010, 3:31pm

there are separate integer and fpu pipes.

[url=“http://www.pcper.com/article.php?aid=954”]http://www.pcper.com/article.php?aid=954[/url]

i guess the core can fill them both at the same time.

perhaps you don’t get the addition speed gain because addition has higher throughput, so how fast the core can fill the pipes becomes the limiting factor. i.e. the core can’t fill fast enough to fill both pipes with additions. but it can do so for multiplications because they have only half the throughput.

this is only an educated guess. i don’t actually know the technical details for sure. but it seems logical.

vonneumann · November 1, 2010, 3:44pm

Yes, I agree.

I have been thinking that if the multplications fill the pipelines, and there is some stall (for example one multiplication cannot be actually executed in just one clock cycle) then I guess that using both of them can lead to some speedup. But again I quote you, this is a guess.

vonneumann · November 1, 2010, 3:44pm

Yes, I agree.

I have been thinking that if the multplications fill the pipelines, and there is some stall (for example one multiplication cannot be actually executed in just one clock cycle) then I guess that using both of them can lead to some speedup. But again I quote you, this is a guess.

Cygnus_X1 · November 4, 2010, 4:02pm

Throughput has changed drastically between 1.3 and 2.0 devices:

Cygnus_X1 · November 4, 2010, 4:02pm

Throughput has changed drastically between 1.3 and 2.0 devices:

Topic		Replies	Views
Arithmetic Operations benchmarking with CUDA FERMI Understanding pure performance of arithmetic on F CUDA Programming and Performance	9	1717	October 27, 2010
What's the peak performance with 32-bit integers? CUDA Programming and Performance	5	3241	July 11, 2009
speed of integer and FP operation on ALU CUDA Programming and Performance	1	4672	May 12, 2008
Single-Precision Floating-Point Basic Arithmetic Throughput CUDA Programming and Performance	2	4362	October 7, 2009
Cuda 3.5 Integer Multiply Performance Is it really 3x slower than 64-bit floating point? CUDA Programming and Performance	21	20110	March 12, 2014
About instruction throughputs CUDA Programming and Performance	9	5220	May 27, 2010
Can we do integer operation and floating point operation concurrently? CUDA Programming and Performance	1	656	December 2, 2010
Integer Arithmetic 32 integer arithmetic performance CUDA Programming and Performance	4	6917	March 7, 2007
clock cycles of double operation CUDA Programming and Performance	9	5160	April 23, 2009
Measurements of different CUDA operator throughputs CUDA Programming and Performance	32	50149	August 24, 2009

FPU and ALU multiplications in parallel Can I take advantage using both of them?

Related topics