FPU and ALU multiplications in parallel Can I take advantage using both of them?

Hello,
going on with my arithmetic benchmarking on a FERMI CPU I noticed that perforing a certain amount of multiplications using integer arithmetic, let us say 100 000 multiplications, it takes a certain time T

If I execute 100 000 integer mulitplications and 100 000 FPU multiplications the execution time does not double and I have something far less than 2T

So I was wondering if I can take advantage in my applications (that involve multi precsions high performance arithmetic) of using both FPU and ALU
to perform multiplication, since it seems that I can exploit a certain level of parallelism.

P.S. I measured that the same does not happen for the additions

Hello,
going on with my arithmetic benchmarking on a FERMI CPU I noticed that perforing a certain amount of multiplications using integer arithmetic, let us say 100 000 multiplications, it takes a certain time T

If I execute 100 000 integer mulitplications and 100 000 FPU multiplications the execution time does not double and I have something far less than 2T

So I was wondering if I can take advantage in my applications (that involve multi precsions high performance arithmetic) of using both FPU and ALU
to perform multiplication, since it seems that I can exploit a certain level of parallelism.

P.S. I measured that the same does not happen for the additions

there are separate integer and fpu pipes.

i guess the core can fill them both at the same time.

perhaps you don’t get the addition speed gain because addition has higher throughput, so how fast the core can fill the pipes becomes the limiting factor. i.e. the core can’t fill fast enough to fill both pipes with additions. but it can do so for multiplications because they have only half the throughput.

this is only an educated guess. i don’t actually know the technical details for sure. but it seems logical.

there are separate integer and fpu pipes.

i guess the core can fill them both at the same time.

perhaps you don’t get the addition speed gain because addition has higher throughput, so how fast the core can fill the pipes becomes the limiting factor. i.e. the core can’t fill fast enough to fill both pipes with additions. but it can do so for multiplications because they have only half the throughput.

this is only an educated guess. i don’t actually know the technical details for sure. but it seems logical.

Yes, I agree.

I have been thinking that if the multplications fill the pipelines, and there is some stall (for example one multiplication cannot be actually executed in just one clock cycle) then I guess that using both of them can lead to some speedup. But again I quote you, this is a guess.

Yes, I agree.

I have been thinking that if the multplications fill the pipelines, and there is some stall (for example one multiplication cannot be actually executed in just one clock cycle) then I guess that using both of them can lead to some speedup. But again I quote you, this is a guess.

Throughput has changed drastically between 1.3 and 2.0 devices:

Throughput has changed drastically between 1.3 and 2.0 devices: