long-integer multiplication: mul.wide.u64 and mul.wide.u128

Hi Norbert,

any reason why you calculate the “second partial sum” (which is offset by 16 bit) after the “first partial sum”? When you calculate the “second partial sum” first, shift it by 16 bit (maybe even inplace but that doesn’t matter here) and than go straigth for the “first partial sum”, you might be able to save some registers, depending on the size of the operants it might mattter.

Did a first try in my code and rewriting a single function (squaring an 80bit int) gave me a small improvement already.

Oliver

I did not do any extreme optimizations here, I was mostly looking for a few hours of fun diversion in tackling this problem (as retired folks are wont to do :-).

I did try to arrange the code so the compiler would produce a minimal number of instructions for the case at hand, but I did not look out for register pressure at all. I am sure with more effort one could optimize more. Keep in mind that you are tackling a different problem and strategies will differ for every use case (e.g. I use a somewhat different approach for an n x n → n multiplication, for which I even wrote a primitive code generator).

Consider posting your code when you are done, it might help the next person (especially it it is under a permissive license like BSD or MIT).

OK, good to know that I didn’t overlook something. You’re right we are looking at different problems/usecases. My code will become public available once it is done - the whole project is available under GPL v3. Might include some “special” cases - I’m thinking about 79bit squaring with an included multiply by two (159bit result). So it is more about sharing ideas.

Oliver

Best of luck in your efforts. You are obviously free to pick any license you consider suitable. Due to the viral nature of GPL licenses, I personally give GPL-licensed code a wide berth.

What’s the SASS your code procudes at your card? For me, (sm52, compute52, CUDA 9.1) there’s no difference (the code is 100% identical down to variable names) between your 64x64 code and this much simpler code:

asm volatile (
		"mad.lo.cc.u32   %0,  %6,  %4,  0;\n\t"
		"madc.lo.cc.u32  %1,  %6,  %5,  0;\n\t"
		"addc.u32        %2,   0,   0;\n\t"

		"mad.hi.cc.u32   %1,  %6,  %4, %1;\n\t"
		"madc.hi.u32     %2,  %6,  %5, %2;\n\t"

		"mad.lo.cc.u32   %1,  %7,  %4, %1;\n\t"
		"madc.lo.cc.u32  %2,  %7,  %5, %2;\n\t"
		"addc.u32        %3,   0,  0;\n\t"

		"mad.hi.cc.u32   %2,  %7,  %4, %2;\n\t"
		"madc.hi.u32     %3,  %7,  %5, %3;\n\t"
		:
		"=r"(c.n[0]), "=r"(c.n[1]), "=r"(c.n[2]), "=r"(c.n[3])
		:
		"r"(a.n[0]), "r"(a.n[1]),
		"r"(b.n[0]), "r"(b.n[1]));

This is the SASS generated on my system:

Fatbin elf code:
================
arch = sm_52
code version = [1,7]
producer = cuda
host = windows
compile_size = 64bit

        code for sm_52
                Function : _Z10testKernelP9uint256_tS0_
        .headerflags    @"EF_CUDA_SM52 EF_CUDA_PTX_SM(EF_CUDA_SM52)"
                                                                                                /* 0x001fd800fec007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20];                                        /* 0x4c98078000870001 */
        /*0010*/                   MOV R6, c[0x0][0x140];                                       /* 0x4c98078005070006 */
        /*0018*/                   IADD32I R6.CC, R6, 0x4;                                      /* 0x1c10000000470606 */
                                                                                                /* 0x0001c4000e2007f2 */
        /*0028*/                   IADD.X R7, RZ, c[0x0][0x144];                                /* 0x4c1008000517ff07 */
        /*0030*/                   LDG.E R2, [R6+-0x4];                                         /* 0xeed42fffffc70602 */
        /*0038*/                   LDG.E R9, [R6+0x1c];                                         /* 0xeed4200001c70609 */
                                                                                                /* 0x011f840016400091 */
        /*0048*/                   LDG.E R0, [R6];                                              /* 0xeed4200000070600 */
        /*0050*/                   LDG.E R5, [R6+0x20];                                         /* 0xeed4200002070605 */
        /*0058*/                   XMAD R3, R9, R2, RZ;                                         /* 0x5b007f8000270903 */
                                                                                                /* 0x081fd0d0fe2607f1 */
        /*0068*/                   XMAD.MRG R4, R9.reuse, R2.H1.reuse, RZ;                      /* 0x5b007fa800270904 */
        /*0070*/                   XMAD R10, R9.reuse, R0.reuse, RZ;                            /* 0x5b007f800007090a */
        /*0078*/                   XMAD.MRG R8, R9.reuse, R0.H1, RZ;                            /* 0x5b007fa800070908 */
                                                                                                /* 0x001f8440fe2a07f1 */
        /*0088*/                   XMAD.PSL.CBCC R4, R9.H1.reuse, R4.H1, R3.reuse;              /* 0x5b30019800470904 */
        /*0090*/                   XMAD.CHI R13, R9.H1.reuse, R2, R3;                           /* 0x5b2801800027090d */
        /*0098*/                   XMAD.PSL.CBCC R8, R9.H1, R8.H1, R10;                         /* 0x5b30051800870908 */
                                                                                                /* 0x001fc440fe4607f1 */
        /*00a8*/                   XMAD R11, R9.reuse, R2.H1.reuse, RZ;                         /* 0x5b007f880027090b */
        /*00b0*/                   XMAD R12, R9.H1.reuse, R2.H1, RZ;                            /* 0x5b207f880027090c */
        /*00b8*/                   IADD R3.CC, RZ, R4;                                          /* 0x5c1080000047ff03 */
                                                                                                /* 0x001f84c0fe260ff1 */
        /*00c8*/                   XMAD R6, R9.reuse, R0.H1.reuse, RZ;                          /* 0x5b007f8800070906 */
        /*00d0*/                   XMAD.CHI R7, R9.H1.reuse, R0.reuse, R10;                     /* 0x5b28050000070907 */
        /*00d8*/                   XMAD R9, R9.H1, R0.H1, RZ;                                   /* 0x5b207f8800070909 */
                                                                                                /* 0x1c1fc400fda007f2 */
        /*00e8*/                   IADD3.RS R11, R13, R11, R12;                                 /* 0x5cc0062000b70d0b */
        /*00f0*/                   IADD.X R4.CC, RZ, R8;                                        /* 0x5c1088000087ff04 */
        /*00f8*/                   XMAD R8, R5.reuse, R2.reuse, RZ;                             /* 0x5b007f8000270508 */
                                                                                                /* 0x001fc400fe2007f1 */
        /*0108*/                   XMAD.MRG R10, R5, R2.H1, RZ;                                 /* 0x5b007fa80027050a */
        /*0110*/                   IADD3.RS R7, R7, R6, R9;                                     /* 0x5cc004a000670707 */
        /*0118*/                   IADD.X R6, RZ, RZ;                                           /* 0x5c1008000ff7ff06 */
                                                                                                /* 0x181fc540fe2007e3 */
        /*0128*/                   IADD R4.CC, R4, R11;                                         /* 0x5c10800000b70404 */
        /*0130*/                   XMAD.PSL.CBCC R10, R5.H1.reuse, R10.H1, R8.reuse;            /* 0x5b30041800a7050a */
        /*0138*/                   XMAD R11, R5.reuse, R0.reuse, RZ;                            /* 0x5b007f800007050b */
                                                                                                /* 0x001fc800fe2207f1 */
        /*0148*/                   XMAD.MRG R12, R5.reuse, R0.H1, RZ;                           /* 0x5b007fa80007050c */
        /*0150*/                   XMAD.CHI R9, R5.H1, R2, R8;                                  /* 0x5b28040000270509 */
        /*0158*/                   IADD.X R6, R6, R7;                                           /* 0x5c10080000770606 */
                                                                                                /* 0x281fc4c0fe2007e1 */
        /*0168*/                   IADD R7.CC, R4, R10;                                         /* 0x5c10800000a70407 */
        /*0170*/                   XMAD R4, R5.reuse, R2.H1.reuse, RZ;                          /* 0x5b007f8800270504 */
        /*0178*/                   XMAD.PSL.CBCC R12, R5.H1.reuse, R12.H1, R11.reuse;           /* 0x5b30059800c7050c */
                                                                                                /* 0x181fc4c0fe2207f1 */
        /*0188*/                   XMAD R2, R5.H1.reuse, R2.H1, RZ;                             /* 0x5b207f8800270502 */
        /*0190*/                   XMAD R8, R5.reuse, R0.H1.reuse, RZ;                          /* 0x5b007f8800070508 */
        /*0198*/                   XMAD R10, R5.H1.reuse, R0.H1.reuse, RZ;                      /* 0x5b207f880007050a */
                                                                                                /* 0x001fcc00fe2007e2 */
        /*01a8*/                   XMAD.CHI R5, R5.H1, R0, R11;                                 /* 0x5b28058000070505 */
        /*01b0*/                   IADD.X R6.CC, R6, R12;                                       /* 0x5c10880000c70606 */
        /*01b8*/                   IADD3.RS R4, R9, R4, R2;                                     /* 0x5cc0012000470904 */
                                                                                                /* 0x001fc400fe2007f2 */
        /*01c8*/                   IADD3.RS R2, R5, R8, R10;                                    /* 0x5cc0052000870502 */
        /*01d0*/                   IADD.X R0, RZ, RZ;                                           /* 0x5c1008000ff7ff00 */
        /*01d8*/                   IADD R6.CC, R6, R4;                                          /* 0x5c10800000470606 */
                                                                                                /* 0x0003c800fe4007e1 */
        /*01e8*/                   MOV R4, c[0x0][0x148];                                       /* 0x4c98078005270004 */
        /*01f0*/                   MOV R5, c[0x0][0x14c];                                       /* 0x4c98078005370005 */
        /*01f8*/                   STG.E [R4], R3;                                              /* 0xeedc200000070403 */
                                                                                                /* 0x0003c4001e2007f0 */
        /*0208*/         {         IADD.X R2, R0, R2;                                           /* 0x5c10080000270002 */
        /*0210*/                   STG.E [R4+0x4], R7;        }                                 /* 0xeedc200000470407 */
        /*0218*/                   STG.E [R4+0x8], R6;                                          /* 0xeedc200000870406 */
                                                                                                /* 0x001ffc00ffe000f5 */
        /*0228*/                   STG.E [R4+0xc], R2;                                          /* 0xeedc200000c70402 */
        /*0230*/                   EXIT;                                                        /* 0xe30000000007000f */
        /*0238*/                   BRA 0x238;                                                   /* 0xe2400fffff87000f */
                .......................................

I don’t know what you are comparing when you say “identical”. While functionally equivalent, your source code is not literally identical to the 64x64 MAD{C}-based code I gave in my original post:

asm ("{\n\t"
         ".reg .u32 r0, r1, r2, r3, alo, ahi, blo, bhi;\n\t"
         "mov.b64         {alo,ahi}, %2;\n\t"
         "mov.b64         {blo,bhi}, %3;\n\t"
         "mul.lo.u32      r0, alo, blo;\n\t"
         "mul.hi.u32      r1, alo, blo; \n\t"
         "mad.lo.cc.u32   r1, alo, bhi, r1;\n\t"
         "madc.hi.u32     r2, alo, bhi, 0;\n\t"
         "mad.lo.cc.u32   r1, ahi, blo, r1;\n\t"
         "madc.hi.cc.u32  r2, ahi, blo, r2;\n\t"
         "madc.hi.u32     r3, ahi, bhi, 0;\n\t"
         "mad.lo.cc.u32   r2, ahi, bhi, r2;\n\t"
         "addc.u32        r3, r3, 0;\n\t"
         "mov.b64         %0, {r0,r1};\n\t"  
         "mov.b64         %1, {r2,r3};\n\t"
         "}"
         : "=l"(res.x), "=l"(res.y)
         : "l"(a), "l"(b));

I used CUDA 8 (I don’t recall which revision) with an sm_50 target for my experiments. You are using CUDA 9.1. It seems entirely possible that compiler improvements were made within the past year. It is also possible that the compiler’s code generation is “brittle” and that equivalent, but somewhat different, arrangements of 32-bit MAD{C}-based 64x64->128 bit multiplication result in different XMAD sequences in the SASS.

For various reasons I am not likely to switch to CUDA 9.x anytime soon, meaning I will not be repeating my previous experiments with that version of the compiler. If your code gives optimal performance with CUDA 9.x that’s great.

Actually, it is the CUDA_ARCH >= 500 variant that gets compiled to the same SASS. I was just wondering if performance was potentially lost in 9.1 compared to 8 - seems strange that it just ignores all your hand crafted optimizations and somehow arrives at a “generic” 64x64bit mult.

So you are saying my source code based on 16-bit multiplies now results in the exact same SASS sequence as your source code based on 32-bit MAD? If so, that would be a Good Thing™, meaning CUDA programmers can now write code in a more natural and concise style, while getting SASS with the same performance. Seems like an improvement to me.

I don’t understand why you think performance may have been lost in the transition from CUDA 8 to CUDA 9.1 here.

(This was already mentioned in this thread…) I’m pretty sure Volta (cc7.0) has a hardware 32-bit integer multiply instruction now:

[url]Programming Guide :: CUDA Toolkit Documentation

that runs at the same rate as FP32. Therefore, if the SASS codes are equivalent in earlier architectures, the 32-bit implementation may be better on 7.0 if it can use that instruction.

Yes, I saw that being mentioned. Is it just a multiply or a multiply-add in V100? Will that capability carry forward to the consumer variants of the Volta architecture?

Let me just say that NVIDIA isn’t doing programmers any favors by changing the hardware ISA around in this fashion every couple of generations.

I haven’t explored/tested myself, but the programming guide specifically calls out multiply-add. There is a footnote which cuts the throughput in half for “extended precision”, but still probably a win over emulation.

The only currently available consumer variant is Titan V. AFAIK Titan V is compute capability 7.0. I would fully expect specs and general programming guide info applicable to cc7.0 to be applicable to Titan V.

There aren’t any announced cc7.x architectures beyond cc7.0 at this time that I’m aware of.

I read a lot of rumors that there will not be any Volta-based consumer cards, and that the next generation of consumer cards will instead be based on a Pascal-derived architecture. If true, XMAD might be with us mortals a bit longer after all :-)