Maxwell integer mul/mad instruction counts

In case anyone is interested, you can see Kepler vs. Maxwell integer mul/mad SASS output here.

Have you had a chance to compare the performance? I am under the impression (possibly mistaken) that XMAD on Maxwell executes at the same rate as FFMA, which should make up for the longer emulation sequences. I do not have Maxwell-based GPU to measure this myself.

I haven’t looked at performance of the 64-bit types but for 32-bit types Maxwell is showing a much higher IPC than Kepler.

For an intense IMUL microbenchmark, an sm_35 device reports an IPC of “1.0”. That’s exactly what’s stated in the throughput table in the Programming Guide (32 imuls/clock).

The sm_50 device reports an IPC of ~3.87 and has a kernel with 3x more instructions (as expected).

So … my napkin calculation is that Maxwell integer throughput is actually higher than Kepler’s for 32-bit integer multiplies. Correct me if my logic is wrong since this is surprising.

One other odd Maxwell result, it looks like a “mul.hi.sat” uses a single instruction IMAD.HI.SAT opcode with a very low IPC of 0.03. I was expecting to see an XMAD sequence instead of an IMAD.

Anyway, I was just curious about Maxwell’s XMAD. :)

Edit: the Programming Guide should probably reflect that the int mul/mad throughput of Maxwell hasn’t regressed in most cases. The words “Multiple Instructions” probably led most of us to conclude that throughput would be significantly less than 32.

Your results are not surprising, so far.

Multipliers in general are relatively resource hungry, that is, they take up relatively large silicon real estate. In Kepler, the IMAD units are separate from the FFMA units, and to limit the additional silicon cost, it makes sense to limit the throughput of IMAD relatively to the throughput of FFMA. In Maxwell regular IMADs are now emulated by XMADs that can share FFMA hardware and execute at the same rate.

So this is a RISC-like approach that replaces relatively slow complex instructions with relatively fast simple instructions. In the common case of 32-bit multiplies with lower 32-bit result it appears to be a net win, in that ~4x instruction throughput and 3x number of instructions combine for a 1.3x speedup. There is a potential caveat: The number of instructions required to emulate more complex and wider version of IMAD tends to grow super-linearly, so there may be emulation sequences for those where the Maxwell version is slower than the Kepler version. An interesting test case might be 64-bit integer division which is heavily dependent on wide integer multiplies.

Not sure what is up with the IMAD.HI.SAT, I did not even know there is such an instruction!

Side note: when it’s available, the PTX “SAT” modifier has been useful in quick/naive microbenchmarks to block the compiler from optimizing away a sequence of ops. But if the ‘.sat’ modifier winds up forcing a hardware slow-path then it’s not a very useful microbenchmark. :)