Parallel use of DP and SP FPUs Can we overlap as the the SFU

I was wondering whether it was possible to overlap instructions the the single and double precision FPUs as it is possible for the SP FPU and the special function unit. At least the later I remember from a talk by Patrick LeGresley.

I have quite some DP additions to do where the emulation accuracy would suffice and it would be a pitty to have those SP FPUs staying around doing nothing. Sadly in my measurements it looks like I cannot utilize DP and SP FPUs at the same time, but I still have the hope someone else figured out how to do it. Any success stories?

I think it all depends on how many sp vs dp operations you have. The dp operations have basically a 8x higher latency if I understand correctly, so when you don’t have enough sp operations, your sp units will be waiting on the outcome of dp operations I think.

Emulation code about 8 instructions. But the point is, that if adding the SP instructions the program runtime increases, which it shouldn’t if I can hide the SP instructions with the DP ones.

The hardware will not schedule DP and SP instructions at the same time.

Please, clarify a bit …

Is it possible to perform all math in float, but division (and other non-IEEE compliant float operations) in double ?

Say, like this:

float x = 10.f, y = 20.f;

double dTemp = (double)x / (double)y;

float y = (float)dTemp;

Will this actually work as expected (so the y will contain truncated double value that has been calculated in IEEE-compliant manner) and how strong this approach will slow things down ?

Thanks in advance.

Yes, you can mix double and single precision variables.

My reply was to the original question was “can they execute at the same time”.

Hmm, when the dp unit is busy doing dp calculations, no sp units are working (on other data). Is that a correct interpretation of what you are saying?

I probably need to convert one or two calculations of a big algorithm to double precision (sinh of large number) as I seem to not be able to rewrite the algorithm to get rid of the intermediate result that overflows. I was hoping it would not make my program get much slower <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=’:’(’ />