Generating XMAD{.X,.CC} by PTX

jowlo · February 11, 2019, 9:51am

Hi everyone!

I am optimizing an application that is almost completely relying on multi-precision integer arithmetic. The main target of the optimization is a Tesla P100. Due to the lack of native 32-bit multiply-and-add instructions, i am trying to speed up the routines with the help of 16-bit XMAD, as explained in the forum (https://devtalk.nvidia.com/default/topic/1017754/long-integer-multiplication-mul-wide-u64-and-mul-wide-u128/) and also in the paper by Emmert et al. (https://ieeexplore.ieee.org/abstract/document/7563271).

To make ptxas generate XMAD instructions, i am using their template, e.g.

#define xmadhh_c_cc(r, a, b, c)                                    \
	asm volatile ("{ 					   \n\t"  \
		".reg .u16		%alo, %ahi, %blo, %bhi;    \n\t"  \
		".reg .u32		%t;                        \n\t"  \
		"mov.b32		{%alo, %ahi}, %1;	   \n\t"  \
		"mov.b32		{%blo, %bhi}, %2;	   \n\t"  \
		"mul.wide.u16		%t, %ahi, %bhi;		   \n\t"  \
		"addc.cc.u32		%0, %3, %t;		   \n\t"  \
	"}"	: "=r"(r) : "r" (a), "r" (b), "r" (c));

However, if i look into the generated SASS code for my platform, i see that for each variant (low/low, high/high, high/low, with and without carry) there is one XMAD and one IADD instruction generated. I never see any XMAD.X or XMAD.CC in the SASS code.

Can anyone with more insight into the compile-steps help me on this one?

Any help is much appreciated!

UPDATE:
In case this is important, i am using

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

jowlo · February 11, 2019, 3:33pm

I hacked together a MWE:

#include <stdint.h>
#include <stdio.h>
#include <cuda.h>
#include <stdlib.h>

#define xmadhh_c_cc(r, a, b, c)                                    \
        asm volatile ("{                                           \n\t"  \
                ".reg .u16              %alo, %ahi, %blo, %bhi;    \n\t"  \
                ".reg .u32              %t;                        \n\t"  \
                "mov.b32                {%alo, %ahi}, %1;          \n\t"  \
                "mov.b32                {%blo, %bhi}, %2;          \n\t"  \
                "mul.wide.u16           %t, %ahi, %bhi;            \n\t"  \
                "addc.cc.u32            %0, %3, %t;                \n\t"  \
        "}"     : "=r"(r) : "r" (a), "r" (b), "r" (c));

__global__
void xmad(uint32_t r, uint32_t a, uint32_t b){
        xmadhh_c_cc(r, a, b, r);
        printf("%x\n", r);
}

int main(){
        xmad<<<1,1>>>(rand(), rand(), rand());
        cudaDeviceSynchronize();
}

And here is the SASS code generated via nvcc -arch=sm_60:

Fatbin elf code:                                                                                                                                                                                                                               
================                                                                                                                                                                                                                               
arch = sm_60                                                                                                                                                                                                                                   
code version = [1,7]                                                                                                                                                                                                                           
producer = cuda                                                                                                                                                                                                                                
host = linux                                                                                                                                                                                                                                   
compile_size = 64bit                                                                                                                                                                                                                           
                                                                                                                                                                                                                                               
        code for sm_60                                                                                                                                                                                                                         
                Function : _Z4xmadjjj                                                                                                                                                                                                          
        .headerflags    @"EF_CUDA_SM60 EF_CUDA_PTX_SM(EF_CUDA_SM60)"                                                                                                                                                                           
                                                                      /* 0x001c4400fe0007f6 */                                                                                                                                                 
        /*0008*/                   MOV R1, c[0x0][0x20] ;             /* 0x4c98078000870001 */                                                                                                                                                 
        /*0010*/         {         IADD32I R1, R1, -0x8 ;             /* 0x1c0fffffff870101 */                                                                                                                                                 
        /*0018*/                   LDC.U16 R0, c[0x0][0x14a]         }                                                                                                                                                                         
                                                                      /* 0xef92000014a7ff00 */                                                                                                                                                 
                                                                      /* 0x001ffc00e62007f0 */                                                                                                                                                 
        /*0028*/         {         MOV32I R4, 0x0 ;                   /* 0x010000000007f004 */                                                                                                                                                 
        /*0030*/                   LDC.U16 R2, c[0x0][0x146]         }                                                                                                                                                                         
                                                                      /* 0xef9200001467ff02 */                                                                                                                                                 
        /*0038*/                   MOV32I R5, 0x0 ;                   /* 0x010000000007f005 */                                                                                                                                                 
                                                                      /* 0x001fd802fec00ff1 */                                                                                                                                                 
        /*0048*/                   LOP32I.AND R0, R0, 0xffff ;        /* 0x0400000ffff70000 */                                                                                                                                                 
        /*0050*/                   LOP32I.AND R2, R2, 0xffff ;        /* 0x0400000ffff70202 */                                                                                                                                                 
        /*0058*/                   XMAD R0, R2, R0, RZ ;              /* 0x5b007f8000070200 */
                                                                      /* 0x0003d800fe0007e2 */
        /*0068*/                   IADD.X R0, R0, c[0x0][0x140] ;     /* 0x4c10080005070000 */
        /*0070*/         {         IADD R6.CC, R1, c[0x0][0x4] ;      /* 0x4c10800000170106 */
        /*0078*/                   STL [R1], R0         }
                                                                      /* 0xef54000000070100 */
                                                                      /* 0x001ffc01ffa007f6 */
        /*0088*/                   IADD.X R7, RZ, c[0x0][0x104] ;     /* 0x4c1008000417ff07 */
        /*0090*/                   JCAL 0x0 ;                         /* 0xe220000000000040 */
        /*0098*/                   EXIT ;                             /* 0xe30000000007000f */
                                                                      /* 0x001f8000fc0007ff */
        /*00a8*/                   BRA 0xa0 ;                         /* 0xe2400fffff07000f */
        /*00b0*/                   NOP;                               /* 0x50b0000000070f00 */
        /*00b8*/                   NOP;                               /* 0x50b0000000070f00 */
                .....................

If I understand this correctly, then /0058/ and /0068/ are the relevant XMAD and IADD instructions for the actual “behaviour”. I would expect a single XMAD.X instruction here.

If anyone can give me any counter-example that in fact does generate the XMAD.{CC,X} instruction on sm_60, i’d be glad already.

jowlo · February 11, 2019, 4:45pm

I compiled my (complete, not MWE) software again with CUDA 8 and surprisingly XMAD.X CC is inserted everywhere i expected it.

However, the code runs significantly at slower (at ~70% of the CUDA10 binary), which again is surprising…

Running all of this through the profiler shows me that i traded no_instruction stalls (CUDA 10) against wait stalls in CUDA 8.

nemmart · February 12, 2019, 4:34pm

Hi Jowlo,

I took a look at your example. Due to some specifics of the hardware, only certain instructions can be carry chained to XMAD. Basically XMAD can chain to XMAD or IADD3 can chain to XMAD.

So here’s an example that shows the correct carry chaining. I run two xmads, the first generates the carry, the second consumes it. However, I see that the compiler isn’t quite generating the optimal code – it’s pulling the IADD out of the XMAD for some reason. I’ll file a bug, but I wouldn’t hold your breath for a fix.

Niall

#include <stdint.h>
#include <stdio.h>
#include <cuda.h>
#include <stdlib.h>

#define xmadhh_cc(r, a, b, c)                                    \
        asm volatile ("{                                           \n\t"  \
                ".reg .u16              %alo, %ahi, %blo, %bhi;    \n\t"  \
                ".reg .u32              %t;                        \n\t"  \
                "mov.b32                {%alo, %ahi}, %1;          \n\t"  \
                "mov.b32                {%blo, %bhi}, %2;          \n\t"  \
                "mul.wide.u16           %t, %ahi, %bhi;            \n\t"  \
                "add.cc.u32             %0, %3, %t;                \n\t"  \
        "}"     : "=r"(r) : "r" (a), "r" (b), "r" (c));


#define xmadhh_c_cc(r, a, b, c)                                    \
        asm volatile ("{                                           \n\t"  \
                ".reg .u16              %alo, %ahi, %blo, %bhi;    \n\t"  \
                ".reg .u32              %t;                        \n\t"  \
                "mov.b32                {%alo, %ahi}, %1;          \n\t"  \
                "mov.b32                {%blo, %bhi}, %2;          \n\t"  \
                "mul.wide.u16           %t, %ahi, %bhi;            \n\t"  \
                "addc.cc.u32            %0, %3, %t;                \n\t"  \
        "}"     : "=r"(r) : "r" (a), "r" (b), "r" (c));

__managed__ uint32_t a0;
__managed__ uint32_t a1;

__managed__ uint32_t b0;
__managed__ uint32_t b1;

__managed__ uint32_t c0;
__managed__ uint32_t c1;

__managed__ uint32_t r0;
__managed__ uint32_t r1;

__global__
void xmad() {
        xmadhh_cc(r0, a0, b0, c0);
        xmadhh_c_cc(r1, a1, b1, c1);
}

int main(){
        a0=rand();
        b0=rand();
        c0=rand();
        a1=rand();
        b1=rand();
        c1=rand();
        xmad<<<1,1>>>();
        cudaDeviceSynchronize();
}

jowlo · February 12, 2019, 5:06pm

Hi Niall,

thank you for looking into this and for filing that bug. And thank you for the nice paper exploiting the 16-bit multipliers i mentioned in the OP!

I can imagine with current architectures moving back to 32-bit int multipliers, development on XMAD transformations is probably not a priority.

I guess my MWE was a bit too minimal to show the behaviour, thanks for fixing that. I did try to compile your example with CUDA 8, 9 and 10, yielding XMAD.X for CUDA 8, but XMAD followed by an IADD for 9 and 10.

However, for the application i am developing, a lot of optimization seems to have happened elsewhere between CUDA 8 and 10: My code runs faster with a (32-bit) mad.u32.{lo,hi} version (using XMAD{.MRG,.PSL,.CBCC} and IADD{.X}) compiled with CUDA 10, than with the 16-bit multiplier optimized version compiled with CUDA 8 (using XMAD{.X,.CC}s and less shifting).

Topic		Replies	Views
XMAD meaning CUDA Programming and Performance	17	6952	April 10, 2017
Integer MAD instruction CUDA Programming and Performance	11	17669	October 22, 2010
Output code uses far too many registers CUDA Programming and Performance	8	1603	February 5, 2017
long-integer multiplication: mul.wide.u64 and mul.wide.u128 CUDA Programming and Performance	31	7671	January 2, 2018
CUDA low-level programming - strange ptxas behavior CUDA Programming and Performance	4	1477	February 17, 2014
performance difference for cuda between experiments and the documentation for float/double data type... CUDA Programming and Performance	8	1903	October 28, 2016
"no instruction" stalls every 256 bytes of the binary code CUDA Programming and Performance	7	1556	February 14, 2019
CUDA 8.0.26 Inline PTX "addc" Bug CUDA Programming and Performance	18	1490	December 26, 2017
Bytes manipulation in PTX CUDA Programming and Performance hw , cuda , kernel	23	2982	March 2, 2023
How to use the `ex2.approx.f16x2` instruction? CUDA Programming and Performance	2	309	August 28, 2024

Generating XMAD{.X,.CC} by PTX

Related topics