PTX is a virtual architecture (and a compiler intermediate format). It is needed because the actual ISA of each generation of NVIDIA’s GPUs is not binary compatible with the previous one, like people might be used to from the x86 world.
Any analysis into efficiency of generated code has to look at the actual hardware instructions which you can see by disassembling a binary with cuobjdump --dump-sass
, where SASS is the machine language provided by the hardware. If you do that, you will find that many PTX instructions are implemented as sequences of SASS instructions, some short, some lengthy. Analyzing GPU code at the PTX level is largely meaningless. NVIDIA does not publish details of the SASS instruction set, however a summary listing of the available instructions with one sentence description is part of the available CUDA documentation.
The translation of PTX to SASS is the job of PTXAS, which is the last of the CUDA compiler stages that are orchestrated by the compiler driver nvcc. Contrary to what the name may suggest, PXTAS is an optimizing compiler. Even so, it cannot work miracles when stitching together multiple emulation sequences such as those involved in the twenty or so PXT flavors of 32-bit IMUL/IMAD (of which about eight are needed to implement an efficient long-integer multiply in sm_2x and sm_3x, as you can tell from the code above).
GPUs are attractive in high performance computing because they offer high performance per device (and related, performance per square millimeter of silicon), high performance per watt, and high performance per dollar. It is the first of these metrics that causes NVIDIA to re-evaluate the silicon budget with every new architecture, and can lead to the removal instructions from hardware, as happened in the switch from IMAD to XMAD. Multipliers are “large” structures as computational units go, and the silicon for separate IMAD units that was saved could be dedicated to other performance features, while XMAD re-uses the hardware of the single-precision floating-point multipliers.
This plays into the second metric: If you look at the Green500 results, you will notice a very sizeable increase in energy efficiency going from Kepler to Maxwell, and again from Maxwell to Pascal. Obviously this wasn’t all just due to micro-architectural changes, but those certainly play a significant role.
With full adaptation of code, the overall throughput of most flavors of individual integer multiplies was preserved in the IMAD/XMAD switch, especially for the common cases. However, code that uses multiple different flavors of complicated IMAD versions, such as integer divisions or the long-integer multiply code discussed here, takes a bit of a performance hit unless it is re-written in terms of mul.wide.u16, at which point parity to previous platforms is roughly restored.
You can find tradeoffs of similar nature between computational units in all major processor families, including the dozen or so architecture generations of x86 processors since the PentiumPro of 1995. Other than for GPUs, instruction emulation code in x86 processors will be inside the CPU, as microcode, since binary compatibility must be preserved.
Depending on what exactly this computation does, it may well benefit from being rewritten in terms of mul.wide.u16 for sm_50 and later processors. You might want to tackle a sample routine, re-write it and see what the profiler has to say. If you want to get a rough idea for possible differences in instruction count and register pressure, you may want to simply build the three variants of mul.wide.u64 that I posted above, compile them for sm_50, then compare them with cuobjdump --dump-sass
.