Fused-multply-subtract No support in Fermi?

Greetings.

I am on the final leg of optimisation for the correlation kernel I working on. On a GTX 480 I am currently sitting at 950 Gflops/s, which is great performance, but I’m still hungry for more :rolleyes:

This computation uses complex valued linear algebra, and I had assumed that Fermi would support a fused-multiply-subtract operation in addition to the fused-multiply-add. The former is required for efficient evaluation of complex multiplication which requires operations like “x -= a*y”. Fused-multiply-subtract is supported in PowerPC and will be supported in Intel’s forthcoming AVX fma extensions. However, when I looked at the generated ptx code from my kernel, I was surprised to find that operations of this form require two-stage evaluation, i.e., a mul followed by a sub. I checked the ptx manual, and confirmed the lack of a fms instruction.

i.e. I have code like this this, which appears in a loop

[codebox] sum11XXreal += row1Xreal * col1Xreal;

sum11XXreal += row1Ximag * col1Ximag;

sum11XXimag += row1Ximag * col1Xreal;

sum11XXimag -= row1Xreal * col1Ximag;[/codebox]

is transformed into this

[codebox] fma.rn.ftz.f32 %f51, %f35, %f43, %f34;

fma.rn.ftz.f32 	%f52, %f36, %f44, %f51;

fma.rn.ftz.f32 	%f53, %f35, %f44, %f33;

mul.ftz.f32 	%f54, %f36, %f43;

sub.ftz.f32 	%f55, %f53, %f54;[/codebox]

So I guess the conclusion is that fms isn’t supported in Fermi? I had hoped to exceed 1 Tflop/s in my kernel, but I’m currently stumped where to ring out the last remaining performance. Since the cost of each of a mul and a sub is the same as a single fma, this implies that performance would be 950 * 5/4 = 1188 Gflops/s if the fms instruction were supported which is about as fast as I could hope for given integer indexing, and shared memory latencies.

Hopefully fms will be supported in future architectures, since the overhead for its addition to the instruction set would be extremely small.

A solution to the problem above I’m working on is to split the imag accumulator into two components, and only combine them at the end of the loop (where the relative negative sign could be included). This will lead to 50% increase in accumulation registers though, so I’m extremely doubtful this will improve performance.

Try decompiling your .cubin with nv50dis. I vaguely remember that compute capability 1.x devices support multiply-subtract. So maybe you are already using it without even knowing about it.

I know it is frustrating when you just want to squeeze out just a few more % of speed and realize your great optimization idea is already included in the baseline results. :)

Excellent suggestion. Thanks.

Ok, it would appear you are correct, the assembly looks as follows (not a one-2-one match to the code above)

[codebox]00000338: 303c0000cae59c40 add ftz rn f32 $r22 mul $r46 $r50 $r30

00000340: 30380000cb051c00 add rn f32 $r20 mul $r48 $r50 $r28

00000348: 306e0000caa79c40 add ftz rn f32 $r30 mul $r42 $r50 $r55

00000350: 303a0000c7055e00 add rn f32 $r21 neg mul $r48 $r49 $r29

[/codebox]

The “add ftz rn f32 $r1 mul $r2 r3 r4” and “add rn f32 $r1 mul $r2 r3 r4” instructions are fma instructions, and “add ftz rn f32 $r1 neg mul $r2 r3 r4” and “add rn f32 $r1 neg mul $r2 r3 r4” are fms instructions. This leads to two questions

  1. Why isn’t fms exposed at the ptx level?

  2. Why are there two different variants of the fma and fms instructions, with and without ftz? I am compiling with the ftz=true flag.

Anyway, looks like I’ll have to keep hunting for those optimizations to get me over the Tflop :)

960 Gflops… Vow! Can you share some tips?

PTX is an intermediate language optimized for “compiler-friendlyness”, not compactness. Assuming that fms(a,b,c) is functionally equivalent to fma(a,b,-c), there is no need to add any extraneous instruction.

I guess having a smaller set of instructions makes high-level optimizations easier, and any decent peephole optimizer will detect all fms eventually.

Good question. :)

ftz is bit 6 (from right) in the opcode, neg are bits 8 and 9, so they should not be conflicting. Unless there is a subtle un-orthogonal encoding peculiarity that nv50dis does not catch?..

Would your GPU burn if you use a hex editor to turn the opcodes into this?

00000338: 303c0000cae59c40 add ftz rn f32 $r22 mul $r46 $r50 $r30

00000340: 30380000cb051c40 add ftz rn f32 $r20 mul $r48 $r50 $r28

00000348: 306e0000caa79c40 add ftz rn f32 $r30 mul $r42 $r50 $r55

00000350: 303a0000c7055e40 add ftz rn f32 $r21 neg mul $r48 $r49 $r29

Or would it turn out to be slower?

I’m only getting 950 B) . Still hoping to get to 1 Tflops.

I am fortunate, in that the algorithm I applying to the GPU has O(N^2) compute but requires only O(N) memory traffic (averaging over many vector outer-products). Thus for large N, it should always be compute limited, and not bandwidth limited. Matrix-matrix multiplication is like this also. In my algorithm, I’m employing a multi-level tiling strategy, with data reuse between threads in a thread block. Each thread block operates on a 32x32 complex tile, and each thread operates on a 4x4 tile - hence I’m using a thread block size of 8x8. The first tile allow me to hide global transactions, and the second hides the shared memory overhead (vs. registers). I’m actually surprised though that it’s so fast, as I had thought that 64 threads per block, wouldn’t give enough warps to fully hide latency. Moving to more threads per block to increase the number of warps isn’t so simple:

  1. I could increase the thread block size, keeping the block tile size fixed. This would shrink the size of the per thread tile, increasing the shared memory overhead. This reduces performance to ~800 Gflops.

  2. Increasing the thread block size, and scaling up the block tile size isn’t an option, as this has the knock on effect of increasing the thread tile size, and lack of registers prevents this.

Anyway, what I’m saying may have no relevance on your kernels in particular, as my optimizations are very particular to the problem at hand.

Understood, but it seems strange to me that ptx would correctly represent fma, but not fms?

Sounds like an interesting experiment. So I would edit the cubin file in a hex editor, and then link with the host code (compiled as a .o) to form the executable?

FMA is an instruction in its own right: it cannot easily be described using other instructions. But FMS can be described with just 2 instructions.

SM 1.x devices support quite complex instructions, such as:

@p1.leu mad.f32.rn p2|r5, -r1, |s[a3+0x24]|, c[0xc0]

which means:

compute the “less or unordered” relation from flags stored in p1, then for all threads where the condition hold, load operand 2 from shared mem with an indirection, load operand 3 from const mem, take absolute value of operand 2, negate operand 1, perform a multiplication rounded toward zero and an addition rounded to the nearest, compute the sign bit and zero bit and store them in the flag register p1…

Should this be encoded as a PTX instruction?

(not sure about the absolute value, but you get the idea…)

The quick-and-dirty way would be to just open the executable in the hex editor, look for the “0xcae59c40 0x303c0000” pattern, and flip those two bits. Should be enough for the experiment.

(although I suspect things may get more interesting with a GTX 460…)