Generating XMAD{.X,.CC} by PTX

Hi everyone!

I am optimizing an application that is almost completely relying on multi-precision integer arithmetic. The main target of the optimization is a Tesla P100. Due to the lack of native 32-bit multiply-and-add instructions, i am trying to speed up the routines with the help of 16-bit XMAD, as explained in the forum (https://devtalk.nvidia.com/default/topic/1017754/long-integer-multiplication-mul-wide-u64-and-mul-wide-u128/) and also in the paper by Emmert et al. (https://ieeexplore.ieee.org/abstract/document/7563271).

To make ptxas generate XMAD instructions, i am using their template, e.g.

#define xmadhh_c_cc(r, a, b, c)                                    \
	asm volatile ("{ 					   \n\t"  \
		".reg .u16		%alo, %ahi, %blo, %bhi;    \n\t"  \
		".reg .u32		%t;                        \n\t"  \
		"mov.b32		{%alo, %ahi}, %1;	   \n\t"  \
		"mov.b32		{%blo, %bhi}, %2;	   \n\t"  \
		"mul.wide.u16		%t, %ahi, %bhi;		   \n\t"  \
		"addc.cc.u32		%0, %3, %t;		   \n\t"  \
	"}"	: "=r"(r) : "r" (a), "r" (b), "r" (c));

However, if i look into the generated SASS code for my platform, i see that for each variant (low/low, high/high, high/low, with and without carry) there is one XMAD and one IADD instruction generated. I never see any XMAD.X or XMAD.CC in the SASS code.

Can anyone with more insight into the compile-steps help me on this one?

Any help is much appreciated!

UPDATE:
In case this is important, i am using

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

I hacked together a MWE:

#include <stdint.h>
#include <stdio.h>
#include <cuda.h>
#include <stdlib.h>

#define xmadhh_c_cc(r, a, b, c)                                    \
        asm volatile ("{                                           \n\t"  \
                ".reg .u16              %alo, %ahi, %blo, %bhi;    \n\t"  \
                ".reg .u32              %t;                        \n\t"  \
                "mov.b32                {%alo, %ahi}, %1;          \n\t"  \
                "mov.b32                {%blo, %bhi}, %2;          \n\t"  \
                "mul.wide.u16           %t, %ahi, %bhi;            \n\t"  \
                "addc.cc.u32            %0, %3, %t;                \n\t"  \
        "}"     : "=r"(r) : "r" (a), "r" (b), "r" (c));

__global__
void xmad(uint32_t r, uint32_t a, uint32_t b){
        xmadhh_c_cc(r, a, b, r);
        printf("%x\n", r);
}

int main(){
        xmad<<<1,1>>>(rand(), rand(), rand());
        cudaDeviceSynchronize();
}

And here is the SASS code generated via nvcc -arch=sm_60:

Fatbin elf code:                                                                                                                                                                                                                               
================                                                                                                                                                                                                                               
arch = sm_60                                                                                                                                                                                                                                   
code version = [1,7]                                                                                                                                                                                                                           
producer = cuda                                                                                                                                                                                                                                
host = linux                                                                                                                                                                                                                                   
compile_size = 64bit                                                                                                                                                                                                                           
                                                                                                                                                                                                                                               
        code for sm_60                                                                                                                                                                                                                         
                Function : _Z4xmadjjj                                                                                                                                                                                                          
        .headerflags    @"EF_CUDA_SM60 EF_CUDA_PTX_SM(EF_CUDA_SM60)"                                                                                                                                                                           
                                                                      /* 0x001c4400fe0007f6 */                                                                                                                                                 
        /*0008*/                   MOV R1, c[0x0][0x20] ;             /* 0x4c98078000870001 */                                                                                                                                                 
        /*0010*/         {         IADD32I R1, R1, -0x8 ;             /* 0x1c0fffffff870101 */                                                                                                                                                 
        /*0018*/                   LDC.U16 R0, c[0x0][0x14a]         }                                                                                                                                                                         
                                                                      /* 0xef92000014a7ff00 */                                                                                                                                                 
                                                                      /* 0x001ffc00e62007f0 */                                                                                                                                                 
        /*0028*/         {         MOV32I R4, 0x0 ;                   /* 0x010000000007f004 */                                                                                                                                                 
        /*0030*/                   LDC.U16 R2, c[0x0][0x146]         }                                                                                                                                                                         
                                                                      /* 0xef9200001467ff02 */                                                                                                                                                 
        /*0038*/                   MOV32I R5, 0x0 ;                   /* 0x010000000007f005 */                                                                                                                                                 
                                                                      /* 0x001fd802fec00ff1 */                                                                                                                                                 
        /*0048*/                   LOP32I.AND R0, R0, 0xffff ;        /* 0x0400000ffff70000 */                                                                                                                                                 
        /*0050*/                   LOP32I.AND R2, R2, 0xffff ;        /* 0x0400000ffff70202 */                                                                                                                                                 
        /*0058*/                   XMAD R0, R2, R0, RZ ;              /* 0x5b007f8000070200 */
                                                                      /* 0x0003d800fe0007e2 */
        /*0068*/                   IADD.X R0, R0, c[0x0][0x140] ;     /* 0x4c10080005070000 */
        /*0070*/         {         IADD R6.CC, R1, c[0x0][0x4] ;      /* 0x4c10800000170106 */
        /*0078*/                   STL [R1], R0         }
                                                                      /* 0xef54000000070100 */
                                                                      /* 0x001ffc01ffa007f6 */
        /*0088*/                   IADD.X R7, RZ, c[0x0][0x104] ;     /* 0x4c1008000417ff07 */
        /*0090*/                   JCAL 0x0 ;                         /* 0xe220000000000040 */
        /*0098*/                   EXIT ;                             /* 0xe30000000007000f */
                                                                      /* 0x001f8000fc0007ff */
        /*00a8*/                   BRA 0xa0 ;                         /* 0xe2400fffff07000f */
        /*00b0*/                   NOP;                               /* 0x50b0000000070f00 */
        /*00b8*/                   NOP;                               /* 0x50b0000000070f00 */
                .....................

If I understand this correctly, then /0058/ and /0068/ are the relevant XMAD and IADD instructions for the actual “behaviour”. I would expect a single XMAD.X instruction here.

If anyone can give me any counter-example that in fact does generate the XMAD.{CC,X} instruction on sm_60, i’d be glad already.

I compiled my (complete, not MWE) software again with CUDA 8 and surprisingly XMAD.X CC is inserted everywhere i expected it.

However, the code runs significantly at slower (at ~70% of the CUDA10 binary), which again is surprising…

Running all of this through the profiler shows me that i traded no_instruction stalls (CUDA 10) against wait stalls in CUDA 8.

Hi Jowlo,

I took a look at your example. Due to some specifics of the hardware, only certain instructions can be carry chained to XMAD. Basically XMAD can chain to XMAD or IADD3 can chain to XMAD.

So here’s an example that shows the correct carry chaining. I run two xmads, the first generates the carry, the second consumes it. However, I see that the compiler isn’t quite generating the optimal code – it’s pulling the IADD out of the XMAD for some reason. I’ll file a bug, but I wouldn’t hold your breath for a fix.

Niall

#include <stdint.h>
#include <stdio.h>
#include <cuda.h>
#include <stdlib.h>

#define xmadhh_cc(r, a, b, c)                                    \
        asm volatile ("{                                           \n\t"  \
                ".reg .u16              %alo, %ahi, %blo, %bhi;    \n\t"  \
                ".reg .u32              %t;                        \n\t"  \
                "mov.b32                {%alo, %ahi}, %1;          \n\t"  \
                "mov.b32                {%blo, %bhi}, %2;          \n\t"  \
                "mul.wide.u16           %t, %ahi, %bhi;            \n\t"  \
                "add.cc.u32             %0, %3, %t;                \n\t"  \
        "}"     : "=r"(r) : "r" (a), "r" (b), "r" (c));


#define xmadhh_c_cc(r, a, b, c)                                    \
        asm volatile ("{                                           \n\t"  \
                ".reg .u16              %alo, %ahi, %blo, %bhi;    \n\t"  \
                ".reg .u32              %t;                        \n\t"  \
                "mov.b32                {%alo, %ahi}, %1;          \n\t"  \
                "mov.b32                {%blo, %bhi}, %2;          \n\t"  \
                "mul.wide.u16           %t, %ahi, %bhi;            \n\t"  \
                "addc.cc.u32            %0, %3, %t;                \n\t"  \
        "}"     : "=r"(r) : "r" (a), "r" (b), "r" (c));

__managed__ uint32_t a0;
__managed__ uint32_t a1;

__managed__ uint32_t b0;
__managed__ uint32_t b1;

__managed__ uint32_t c0;
__managed__ uint32_t c1;

__managed__ uint32_t r0;
__managed__ uint32_t r1;

__global__
void xmad() {
        xmadhh_cc(r0, a0, b0, c0);
        xmadhh_c_cc(r1, a1, b1, c1);
}

int main(){
        a0=rand();
        b0=rand();
        c0=rand();
        a1=rand();
        b1=rand();
        c1=rand();
        xmad<<<1,1>>>();
        cudaDeviceSynchronize();
}

Hi Niall,

thank you for looking into this and for filing that bug. And thank you for the nice paper exploiting the 16-bit multipliers i mentioned in the OP!

I can imagine with current architectures moving back to 32-bit int multipliers, development on XMAD transformations is probably not a priority.

I guess my MWE was a bit too minimal to show the behaviour, thanks for fixing that. I did try to compile your example with CUDA 8, 9 and 10, yielding XMAD.X for CUDA 8, but XMAD followed by an IADD for 9 and 10.

However, for the application i am developing, a lot of optimization seems to have happened elsewhere between CUDA 8 and 10: My code runs faster with a (32-bit) mad.u32.{lo,hi} version (using XMAD{.MRG,.PSL,.CBCC} and IADD{.X}) compiled with CUDA 10, than with the 16-bit multiplier optimized version compiled with CUDA 8 (using XMAD{.X,.CC}s and less shifting).