long-integer multiplication: mul.wide.u64 and mul.wide.u128

njuffa · July 11, 2017, 2:54am

PTX is a virtual architecture (and a compiler intermediate format). It is needed because the actual ISA of each generation of NVIDIA’s GPUs is not binary compatible with the previous one, like people might be used to from the x86 world.

Any analysis into efficiency of generated code has to look at the actual hardware instructions which you can see by disassembling a binary with cuobjdump --dump-sass, where SASS is the machine language provided by the hardware. If you do that, you will find that many PTX instructions are implemented as sequences of SASS instructions, some short, some lengthy. Analyzing GPU code at the PTX level is largely meaningless. NVIDIA does not publish details of the SASS instruction set, however a summary listing of the available instructions with one sentence description is part of the available CUDA documentation.

The translation of PTX to SASS is the job of PTXAS, which is the last of the CUDA compiler stages that are orchestrated by the compiler driver nvcc. Contrary to what the name may suggest, PXTAS is an optimizing compiler. Even so, it cannot work miracles when stitching together multiple emulation sequences such as those involved in the twenty or so PXT flavors of 32-bit IMUL/IMAD (of which about eight are needed to implement an efficient long-integer multiply in sm_2x and sm_3x, as you can tell from the code above).

GPUs are attractive in high performance computing because they offer high performance per device (and related, performance per square millimeter of silicon), high performance per watt, and high performance per dollar. It is the first of these metrics that causes NVIDIA to re-evaluate the silicon budget with every new architecture, and can lead to the removal instructions from hardware, as happened in the switch from IMAD to XMAD. Multipliers are “large” structures as computational units go, and the silicon for separate IMAD units that was saved could be dedicated to other performance features, while XMAD re-uses the hardware of the single-precision floating-point multipliers.

This plays into the second metric: If you look at the Green500 results, you will notice a very sizeable increase in energy efficiency going from Kepler to Maxwell, and again from Maxwell to Pascal. Obviously this wasn’t all just due to micro-architectural changes, but those certainly play a significant role.

With full adaptation of code, the overall throughput of most flavors of individual integer multiplies was preserved in the IMAD/XMAD switch, especially for the common cases. However, code that uses multiple different flavors of complicated IMAD versions, such as integer divisions or the long-integer multiply code discussed here, takes a bit of a performance hit unless it is re-written in terms of mul.wide.u16, at which point parity to previous platforms is roughly restored.

You can find tradeoffs of similar nature between computational units in all major processor families, including the dozen or so architecture generations of x86 processors since the PentiumPro of 1995. Other than for GPUs, instruction emulation code in x86 processors will be inside the CPU, as microcode, since binary compatibility must be preserved.

Depending on what exactly this computation does, it may well benefit from being rewritten in terms of mul.wide.u16 for sm_50 and later processors. You might want to tackle a sample routine, re-write it and see what the profiler has to say. If you want to get a rough idea for possible differences in instruction count and register pressure, you may want to simply build the three variants of mul.wide.u64 that I posted above, compile them for sm_50, then compare them with cuobjdump --dump-sass.

Topic		Replies	Views
32-bit number multiplication CUDA Programming and Performance	23	20573	July 1, 2012
XMAD meaning CUDA Programming and Performance	17	6997	April 10, 2017
32-bit multiplication and 64-bit registers CUDA Programming and Performance	6	6099	December 10, 2008
Cuda 3.5 Integer Multiply Performance Is it really 3x slower than 64-bit floating point? CUDA Programming and Performance	21	19984	March 12, 2014
why shift is slower than integer multiply shift ,integer multiply CUDA Programming and Performance	20	5781	July 1, 2010
Integer MAD instruction CUDA Programming and Performance	11	17716	October 22, 2010
PTX u32 wide multiplication How-to and performance characteristics? CUDA Programming and Performance	7	2054	October 12, 2010
Blackwell Integer CUDA Programming and Performance	139	2962	June 26, 2025
Technical questions on GTX1080ti multiplication CUDA Programming and Performance	14	1977	November 11, 2017
how to implement mul.wide.u32 in C code 32-bit multiplication and 64-bit registers CUDA Programming and Performance	4	2297	July 29, 2009

long-integer multiplication: mul.wide.u64 and mul.wide.u128

Related topics