nofma option (PGIAcc)

When I run my PGI Accelerator application compiling/using the PGI 11.10 compiler (instead of the old 10.9) version, my program gets slower (i.e. earlier it needed 22 seconds, now it takes 32 seconds)! I figured out that I can get the “old” performance by denoting the option “nofma” to the current compiler. But I do not understand why since FMA should acutally be faster than single mul and add instructions. Furthermore, it is very wired that the number of mul instructions in the ptx codes decrease (!) when I use the nofma-option. I would have expected it to increase (as the number of add instructions does). Can it be that the current version somehow interprets non-mul instructions as mul-instructions and therefore has a high number of mul instructions? And then, when I add “nofma” the “interpretation” is correct again?
Do you understand what I mean?

Any idea?

The only difference for ‘nofma’ is that the compiler generates __fmul_rn() or __dmul_rn() calls instead of using ‘*’ for floating point multiply. Otherwise, the generated code is exactly the same. This is a realy puzzle.

Can you compile both ways with ‘keepgpu’ and ‘keepgpu,nofma’ and send us the .gpu file? I’d like to see what is happening here.

A thought occurred to me … do you have multiply-by-constant? I think the eventual code generation will turn ‘a*2’ into ‘a+a’, whereas ‘__fmul_rn(a,2)’ is left as a multiply.

Dear Michael,
I have a couple of INTEGER multplies-by-constants (e.g. 4*i+3 with i integer), but almost no floating point multiplies-by-constants :-( I only have divisions: 1.0 / f with f floating point…

Hi Sandra,

Can you send PGI Customer Service a reproducing example?