XMAD meaning

Hi all,

I just came to know about SASS native ISA on one of Pascal GPU(Titan X). And I found out XMAD instructions which is not detailed in nividia utility document(http://docs.nvidia.com/cuda/cuda-binary-utilities/index.html).

I found out that “mul.lo.u32 %r0, %r0, %5” can always be translated to the following SASS code:

XMAD R8, R5.reuse, R0.reuse, RZ;
XMAD.MRG R10, R5.reuse, R0.H1, RZ;
XMAD.PSL.CBCC R0, R5.H1, R10.H1, R8;

I just kind of know that XMAD might be integer short multi-add, but I don’t know “MRG, PSL, CBCC, H1” mean in the instruction. I tried to use cuda-gdb to debug it but when I add -G in nvcc option. XMAD turns to be gone.

Does any one have ideas about what they mean? Thanks!

NVIDIA does not provide full documentation for each architecture-specific instruction set used by SASS (machine code). You can guess and reverse engineer based on the code produced for known functionality, such as this case of a 32-bit multiply.

The name and number of source operands suggests that XMAD is a multiply-add instruction. The .H1 suggests that it operates on half-register inputs for multiplication (I would guess there is also .H0, but that this is likely the default setting and thus not shown), and the use of full registers for the third source and the destination suggests it uses those during additions. Thus “X” presumably stands for “eXtended”, because two 16-bit factors are multiplied into a 32-bit product which is added to a 32-bit addend, giving a 32-bit result.

The .MRG suffix presumably means “merge”, although I wouldn’t know what the merge semantics are. I have only a wild guess with regard to the .PSL suffix, maybe “product shift left”? The .CBCC possibly has to do with carry handling, this is suggested by the naming of the “CC” part, no clue as to the “CB” part. It might become clearer if you look at a 32-bit mulhi operation or a 64-bit multiplication as both require carry propagation between 32-bit registers.

For a 32-bit multiply we need to add three partial products: r0.lo*r5.lo + ((r0.lo * r5.hi) << 16) + ((r0.hi * r5.lo) << 16) [here: .lo => .H0, .hi =>.H1]; it’s not quite clear how exactly these correspond to the above instructions, e.g. why does the third XMAD use two .hi (= .H1) components?

You can probably find out more from Scott Gray’s Maxwell SASS assembler: https://github.com/NervanaSystems/maxas. Scott spent considerable time reverse-engineering the Maxwell instruction set, including the control / op-steering words inserted for every group of three instructions.

[Later:] Slides 12ff. of this presentation whose author list includes NVIDIA engineers shows some details of XMAD: http://arith23.gforge.inria.fr/slides/Emmart_Luitjens_Weems_Woolley.pptx

I did file an RFE asking for XMAD to be exposed as a PTX expression. Norbert, that presentation you found is an excellent example of why it could be useful! I am interested in experimenting with its efficiency in several PRNG variants.

I completely agree. The native integer multiplication support in sm_3x was great for multiword arithmetic (see http://stackoverflow.com/questions/6162140/128-bit-integer-on-cuda for a simple example), in that it was easy to write the PTX and one could get full performance doing so. This is particularly important for cryptography.

As the slide presentation correctly states, executing that PTX code written in the Kepler time frame with Maxwell’s XMAD-based emulation of these PTX instructions makes for quite inefficient code, and one really needs direct access to XMAD to be fully efficient once again.

The same authors have written an IEEE paper on the same topic.
It has a very interesting small note not mentioned in the CUDA 8 docs, probably because it’s too low level.

Thanks for the pointer to the paper. I actually downloaded this paper some months back, but apparently completely forgot about this interesting and useful detail about the compiler recognizing XMAD idioms used at PTX level.

I bet you’re right that “PSL” is product shift left to place the low half of a result into the high half of the accumulator, avoiding the need for a whole extra bitshift (or byte permute). This kind of control is why it’s best for XMAD to become a PTX primative since we have no control whether the compiler can recognize an XMAD in anything more complex than the magic PTX replacement template given in that paper.

I started to play with XMAD this morning using that paper PTX to XMAD snippet, and within two minutes realized I needed this “put result in the high bits” feature.

Yah, finer grain control over XMAD would be pretty useful. In particular it would be nice to be able to multiply a 32 bit value by a 16 bit value in just 2 instructions instead of 3:

XMAD d, a, b, c;
XMAD.PSL d, a.H1, b, d;

I haven’t tried in a while but I’m betting it’s still not possible to generate this in ptx.

I spent an hour or so trying to coax C or ptx to generate that 2-XMAD u16 x u32 multiply in output SASS.

No success.

Observations: using easy pure CUDA C, you’ll get pure optimal single XMAD 16x16->16 bit multiplies by multiplying unsigned shorts. But that won’t recognize a .h1 high word opportunity when you >>16 the short initializer. I guess that is asking too much for ptxas to recognize, though it DOES recognize .h0 and .h1 opportunities using the IEEE paper u16 local variable assignment incantation.

C code with an 16 bit immediate, like d=a12345+b, was translated into PTX mad.lo.s32 with an immediate as expected. And, happily, this was converted into the desired XMAD and XMAD.PSL pair in SASS!! So ptxas does know how to properly generate optimal 2-XMAD u16u32 when it knows it’s u16*u32.

ptxas acts very straightforwardly translating PTX to SASS, generating an optimal 1, 2, or 3 XMAD result for each ptx mul or mad, given the information the single ptx line conveys. But PTX itself is not descriptive enough to annotate the extra information of “one register argument is 16 bit, one is 32 bit” in mul or mad. ptxas is able to notice the u16 itself when given an 16 bit immediate (and generates optimal 2-XMAD SASS), but not with a u32 register argument that is clearly holding only a u16 value by initialization.

So, Scott, you were correct. We can’t do this in PTX since PTX isn’t descriptive enough and ptxas doesn’t try to value track multiple ptx statements to understand when an argument is 16 bit. The significant flaw in my hypothesis is that ptxas DOES successfully analyze multiple lines to notice the u32->u16 PTX local variables from the IEEE paper method to track .h0 and .h1 opportunities.

Well, that is the problem I see with that sort of hardware design: it all seems oh-so clever and elegant (in a minimalist, one-button Apple mouse sort of way), until you actually have to make the functionality available to software in a non-quirky way to fully utilize it. And for 99% of programmers “software” means an HLL, not diddling around with assembly language.

To be fair, XMAD does the job of reducing energy usage and silicon real estate reduction for the most common cases of integer multiplication, the 32x32->32 bit multiply.

Thought I’d give some input for the possible XMAD modifiers.
The following show what a * b + c is converted into with the modifiers:

  • PSL (product shift left, as stated earlier): (a * b) << 16 + c
  • MRG (move register): (a * b + c) & 0xffff + (b << 16)
  • CBCC: a * b + c + (b << 16)
  • CHI (constant high): a * b + (c >> 16)
  • CLO (constant low): a * b + (c & 0xffff)
  • In addition, both the a and b registers can have H1 applied to them, which selects the high 16 bits to use in the XMAD instead. However, both MRG and CBCC ignore the H1 and will apply the shifted lower 16 bits of b regardless.

    Thanks for puzzling that out.

    I would think that MRG stands for “merge”: It takes bits [15:0] from (a * b + c), while taking bits [31:16] from (b << 16). So really: ((a * b + c) & 0xffff) | (b << 16), since the fields don’t overlap.

    The ‘C’ in CHI and CLO presumably simply refers to the 32-bit C operand. I am still puzzled by the CBCC mode. “CB” because it adds both C and B? Also note that the code in the original post uses XMAD.PSL.CBCC, so it looks like the .CBCC suffix is orthogonal while according to the listed semantics they would be mutually exclusive. More research seems needed.

    You’re probably right about the naming schemes and yes the ‘+’ can be substituted for a ‘|’ in my MRG definition. However, PSL and CBCC are not mutually exclusive. Using MRG instead of CBCC would render the PSL useless because MRG overwrites the higher bits while CBCC adds to them.

    Here’s what this computes:
    XMAD.PSL.CBCC R0, R1, R2, R3;
    R0 := (R1 * R2) << 16 + R3 + (R2 << 16)

    Thanks for the clarification. So the modifiers

    .PSL => shift product left prior to 32-bit addition
    .CBCC => add (b << 16) to c prior to 32-bit addition

    In other words these are directly used as select signal for separate muxes in the XMAD datapath, with .MRG driving yet another mux. Makes sense.

    ladberg, thanks for the backengineering! I spent a while trying to puzzle them out and didn’t get that far.

    I had guessed the CC in CBCC had something to do with setting the carry flag.

    A few months back I put in an RFE to expose XMAD to PTX. It was later marked “will not fix” with no discussion. I suspect it was rejected because of the combinatorial explosion of all the XMAD suffixes and .h0 .h1 options creating a lot of new PTX grammar, and I don’t blame them. Or (totally speculating) perhaps because XMAD may be replaced with a different integer multiply schema with later hardware and if XMAD were in PTX, they’d have a lot of annoying emulation code to write (much like the work needed to emulate Kepler’s video/SIMD on Maxwell/Pascal).

    I am not surprised that the RFE was shot down, and I think it is for the first reason stated. As I said before, XMAD is a great example of how to design “elegant” hardware without keeping the needs of software in mind.

    I very much doubt that NVIDIA will change future GPU architectures back to include high-throughput 32-bit multiplier capability. The 10,000ft view is that XMAD addresses 95% of use cases via reasonable emulation and that the hardware savings from moving to XMAD are probably very attractive. As far as computational circuits go, multipliers are big and power hungry.

    NVidia’s implemented the integer multiply hardware in three (or more!) very different ways in various generations. Tesla had an integer mul24() reusing the 24 bit multiplier from the FP32 unit, and mul32 was built up from two of those calls and some shift/adds. Then Fermi had a full integer 32 bit multiplier (possibly FP32 reused part of this for itself as well). Then Maxwell/Pascal have a 16 bit MADD multiplier (perhaps reused from the FP32 unit, but if so, why not expose 24 bit multiplies?) But GP100 is crazy since it has a 24 bit multiplier (for fp32), likely a 16 bit MADD multiplier (for XMAD), and then TWO 11-bit multipliers (for fp16x2 SIMD half precision). And Kepler was unique too with its SIMD integer instructions as well.

    Oh, and I forgot Pascal’s DP2A and DP4A instructions, which means there must be even more complexities to allow four 8x8 or two 8x16 multiplies… again, reusing some of the 16 bit multiplier or full extra multiplier unit?

    There’s also argument for exposing the 52 bit integer multiplier from the DP units. Intel has just done that for AVX-512. Cryptography and number theory applications are always eager for giant integer multipliers and it’d be an extra selling point for GV100 Teslas for niche users.

    (Edit: GP100, not GF100, thanks njuffa!)

    I think you meant “GP100 is crazy …” ?

    I vaguely recall that Sun’s SuperSPARC re-used the DP multiplier for integer multiplies. I would have to pull out my first thesis to confirm, where I tuned a crypto library for various SPARC platforms.

    [Later:] I mis-remembered. It was the integer division instruction ‘udiv’ that was limited to dividends <= 2**52 on the SuperSPARC because it re-used the floating-point divider.