On GK110 in single precision I can get to >~ 4 TFLOP/s in artificial benchmarking code where 3 of the four register arguments to FFMA are identical (a = a*b+a).
With 2 identical registers (c = a*b+c) I get to about 3 TFLOP/s.
In real world code ptxas somehow decides to randomly shuffle data between the registers so I get 4 different reg (d = a*b+c) and I only get to about 2 TFLOP/s.
I remember that Nvidia GPUs have suffered from register bandwidth starvation before, so I wonder if this could be the case again (although I don’t really see where the difference between 3-address and 4-address FFMA should come from).
Unfortunately I don’t know how to force ptxas into generation of either 3-address or 4-address FFMA, so I can’t test the hypothesis without going to the pain of patching object binaries.
Has anyone observed > 2 TFLOP/s on 4-address FFMA? Alternatively, does anyone know how to tell ptxas to keep data in the same register?