Greetings.
I am on the final leg of optimisation for the correlation kernel I working on. On a GTX 480 I am currently sitting at 950 Gflops/s, which is great performance, but I’m still hungry for more :rolleyes:
This computation uses complex valued linear algebra, and I had assumed that Fermi would support a fused-multiply-subtract operation in addition to the fused-multiply-add. The former is required for efficient evaluation of complex multiplication which requires operations like “x -= a*y”. Fused-multiply-subtract is supported in PowerPC and will be supported in Intel’s forthcoming AVX fma extensions. However, when I looked at the generated ptx code from my kernel, I was surprised to find that operations of this form require two-stage evaluation, i.e., a mul followed by a sub. I checked the ptx manual, and confirmed the lack of a fms instruction.
i.e. I have code like this this, which appears in a loop
[codebox] sum11XXreal += row1Xreal * col1Xreal;
sum11XXreal += row1Ximag * col1Ximag;
sum11XXimag += row1Ximag * col1Xreal;
sum11XXimag -= row1Xreal * col1Ximag;[/codebox]
is transformed into this
[codebox] fma.rn.ftz.f32 %f51, %f35, %f43, %f34;
fma.rn.ftz.f32 %f52, %f36, %f44, %f51;
fma.rn.ftz.f32 %f53, %f35, %f44, %f33;
mul.ftz.f32 %f54, %f36, %f43;
sub.ftz.f32 %f55, %f53, %f54;[/codebox]
So I guess the conclusion is that fms isn’t supported in Fermi? I had hoped to exceed 1 Tflop/s in my kernel, but I’m currently stumped where to ring out the last remaining performance. Since the cost of each of a mul and a sub is the same as a single fma, this implies that performance would be 950 * 5/4 = 1188 Gflops/s if the fms instruction were supported which is about as fast as I could hope for given integer indexing, and shared memory latencies.
Hopefully fms will be supported in future architectures, since the overhead for its addition to the instruction set would be extremely small.
A solution to the problem above I’m working on is to split the imag accumulator into two components, and only combine them at the end of the loop (where the relative negative sign could be included). This will lead to 50% increase in accumulation registers though, so I’m extremely doubtful this will improve performance.