long-integer multiplication: mul.wide.u64 and mul.wide.u128

Various use cases require wider integer data types than the widest standard data types provided by the C/C++ standards, and thus CUDA. Particularly crucial building blocks are often “wide” multiplies that return the full 2n-bit product of two n-bit multiplicands.

From sm_20 on, it used to be trivial to compose such long-integer arithmetic primitives with the help of the PTX instructions mad{c}{.cc} that mapped directly to underlying IMAD hardware instructions. This hardware support disappeared with sm_50, and is not likely to ever come back because it requires too much silicon real estate that can more profitably be utilized for other functional units, e.g. those targeted at AI. The basic building block for integer multiplies on sm_50 and younger is the XMAD instruction, which multiplies two 16-bit multiplicands into a 32-bit full product, to which another 32-bit operand is added.

While PTXAS provides quite efficient emulation sequences for the PTX instructions mad{c}{.cc}, the common subexpression elimination of PTXAS is unable to remove many duplicate operations that occur when composing long-integer multiplies from these PTX instructions; this is probably a question of different modes being used on individual XMAD instructions that comprise most of these emulation sequences. The only way to create truly efficient long-integer multiplies on sm_50 and beyond at the PTX level is to composite them from mul.wide.u16 instructions, which CUDA 8’s PTXAS does a good job of transforming into XMAD instructions at SASS level.

Below are sample implementations of a 64 x 64 -> 128 bit multiply and a 128 x 128 -> 256 bit multiply. In either case, the classical version using IMAD for sm_2x and sm_3x is shown, as well as the XMAD based versions for sm_5x and sm_6x. As far as the XMAD-based code goes, these are first attempts of mine: improvements are likely possible, in particular for mul.wide.u128 which I lazily composited from four instances of mul.wide.u64. While the XMAD-based code compiles to many more instructions, it should be roughly the same throughput on sm_5x and sm_6x as the old IMAD-based code had on sm_2x and sm_3x. This is due to the high throughput of XMAD. Register pressure is increased by the switch from IMAD to XMAD; presumably NVIDIA figured this would be a reasonable trade-off given the relatively copious register resources of newer GPU architectures.

[code updated 7/12/2017]

/*
  Copyright (c) 2017, Norbert Juffa
  All rights reserved.

  Redistribution and use in source and binary forms, with or without 
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright 
     notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
  "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
  SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 
  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
// multiply two unsigned 64-bit integers into an unsigned 128-bit product
__device__ ulonglong2 umul64wide (uint64_t a, uint64_t b)
{
    ulonglong2 res;
#if (__CUDA_ARCH__ >= 200) && (__CUDA_ARCH__ < 500)
    asm ("{\n\t"
         ".reg .u32 r0, r1, r2, r3, alo, ahi, blo, bhi;\n\t"
         "mov.b64         {alo,ahi}, %2;\n\t"
         "mov.b64         {blo,bhi}, %3;\n\t"
         "mul.lo.u32      r0, alo, blo;\n\t"
         "mul.hi.u32      r1, alo, blo; \n\t"
         "mad.lo.cc.u32   r1, alo, bhi, r1;\n\t"
         "madc.hi.u32     r2, alo, bhi, 0;\n\t"
         "mad.lo.cc.u32   r1, ahi, blo, r1;\n\t"
         "madc.hi.cc.u32  r2, ahi, blo, r2;\n\t"
         "madc.hi.u32     r3, ahi, bhi, 0;\n\t"
         "mad.lo.cc.u32   r2, ahi, bhi, r2;\n\t"
         "addc.u32        r3, r3, 0;\n\t"
         "mov.b64         %0, {r0,r1};\n\t"  
         "mov.b64         %1, {r2,r3};\n\t"
         "}"
         : "=l"(res.x), "=l"(res.y)
         : "l"(a), "l"(b));
#elif __CUDA_ARCH__ >= 500
    asm ("{\n\t"
         ".reg .u32       alo, ahi, blo, bhi, r0, r1, r2, r3;\n\t"
         ".reg .u32       s0, s1, s2, s3, t0, t1, t2, t3;\n\t"
         ".reg .u16       a0, a1, a2, a3, b0, b1, b2, b3;\n\t"
         // split inputs into 16-bit chunks
         "mov.b64         {alo,ahi}, %2;\n\t"
         "mov.b64         {blo,bhi}, %3;\n\t"
         "mov.b32         {a0,a1}, alo;\n\t"
         "mov.b32         {a2,a3}, ahi;\n\t"
         "mov.b32         {b0,b1}, blo;\n\t"
         "mov.b32         {b2,b3}, bhi;\n\t"
         // first partial sum:
         // a3b3.wide  a1b3.wide  a0b2.wide  a0b0.wide
         //     0      a2b2.wide  a1b1.wide  
         //     0      a3b1.wide  a2b0.wide
         "mul.wide.u16    r0, a0, b0;\n\t"
         "mul.wide.u16    r1, a0, b2;\n\t"
         "mul.wide.u16    r2, a1, b3;\n\t"
         "mul.wide.u16    r3, a3, b3;\n\t"
         "mul.wide.u16    t1, a1, b1;\n\t"
         "mul.wide.u16    t2, a2, b2;\n\t"
         "add.cc.u32      r1, r1, t1;\n\t"
         "addc.cc.u32     r2, r2, t2;\n\t"
         "addc.u32        r3, r3, 0;\n\t"
         "mul.wide.u16    t1, a2, b0;\n\t"
         "mul.wide.u16    t2, a3, b1;\n\t"
         "add.cc.u32      r1, r1, t1;\n\t"
         "addc.cc.u32     r2, r2, t2;\n\t"
         "addc.u32        r3, r3, 0;\n\t"
         // second partial sum:
         // a2b3.wide  a0b3.wide  a0b1.wide
         // a3b2.wide  a1b2.wide  a1b0.wide 
         //     0      a2b1.wide
         //     0      a3b0.wide
         "mul.wide.u16    s0, a0, b1;\n\t"
         "mul.wide.u16    s1, a0, b3;\n\t"
         "mul.wide.u16    s2, a2, b3;\n\t"
         "mul.wide.u16    t1, a2, b1;\n\t"
         "add.cc.u32      s1, s1, t1;\n\t"
         "addc.u32        s2, s2, 0;\n\t"
         "mul.wide.u16    t1, a3, b0;\n\t"
         "add.cc.u32      s1, s1, t1;\n\t"
         "addc.u32        s2, s2, 0;\n\t"
         "mul.wide.u16    t0, a1, b0;\n\t"
         "mul.wide.u16    t1, a1, b2;\n\t"
         "mul.wide.u16    t2, a3, b2;\n\t"
         "add.cc.u32      s0, s0, t0;\n\t"
         "addc.cc.u32     s1, s1, t1;\n\t"
         "addc.cc.u32     s2, s2, t2;\n\t"
         "addc.u32        s3, 0, 0;\n\t"
         // offset second partial sum by 16 bits to the left
         "shf.l.clamp.b32 t3, s2, s3, 16;\n\t"
         "shf.l.clamp.b32 t2, s1, s2, 16;\n\t"
         "shf.l.clamp.b32 t1, s0, s1, 16;\n\t"
         "shf.l.clamp.b32 t0,  0, s0, 16;\n\t"
         // add first sum in r{0,1,2,3} to second sum in t{0,1,2,3}
         "add.cc.u32      r0, r0, t0;\n\t"
         "addc.cc.u32     r1, r1, t1;\n\t"
         "addc.cc.u32     r2, r2, t2;\n\t"
         "addc.u32        r3, r3, t3;\n\t"
         // pack outputs
         "mov.b64         %0, {r0,r1};\n\t"
         "mov.b64         %1, {r2,r3};\n\t"
         "}"
         : "=l"(res.x), "=l"(res.y)
         : "l"(a), "l"(b));
#elif __CUDA_ARCH__
#error unsupported __CUDA_ARCH__
#else // avoid warning
    res.x = 0;
    res.y = 0;
#endif
    return res;
}

// multiply two unsigned 128-bit integers into an unsigned 256-bit product
__device__ ulonglong4 umul128wide (ulonglong2 a, ulonglong2 b)
{
    ulonglong4 res;
#if (__CUDA_ARCH__ >= 200) && (__CUDA_ARCH__ < 500)
    asm ("{\n\t"
         ".reg .u32 r0, r1, r2, r3, r4, r5, r6, r7;\n\t"
         ".reg .u32 a0, a1, a2, a3, b0, b1, b2, b3;\n\t"
         "mov.b64         {a0,a1}, %4;\n\t"
         "mov.b64         {a2,a3}, %5;\n\t"
         "mov.b64         {b0,b1}, %6;\n\t"
         "mov.b64         {b2,b3}, %7;\n\t"
         "mul.lo.u32      r0, a0, b0;\n\t"
         "mul.hi.u32      r1, a0, b0;\n\t"
         "mad.lo.cc.u32   r1, a0, b1, r1;\n\t"
         "madc.hi.u32     r2, a0, b1, 0;\n\t"
         "mad.lo.cc.u32   r1, a1, b0, r1;\n\t"
         "madc.hi.cc.u32  r2, a1, b0, r2;\n\t"
         "madc.hi.u32     r3, a0, b2, 0;\n\t"
         "mad.lo.cc.u32   r2, a0, b2, r2;\n\t"
         "madc.hi.cc.u32  r3, a1, b1, r3;\n\t"
         "madc.hi.u32     r4, a0, b3, 0;\n\t"
         "mad.lo.cc.u32   r2, a1, b1, r2;\n\t"
         "madc.hi.cc.u32  r3, a2, b0, r3;\n\t"
         "madc.hi.cc.u32  r4, a1, b2, r4;\n\t"
         "madc.hi.u32     r5, a1, b3, 0;\n\t"
         "mad.lo.cc.u32   r2, a2, b0, r2;\n\t"
         "madc.lo.cc.u32  r3, a0, b3, r3;\n\t"
         "madc.hi.cc.u32  r4, a2, b1, r4;\n\t"
         "madc.hi.cc.u32  r5, a2, b2, r5;\n\t"
         "madc.hi.u32     r6, a2, b3, 0;\n\t"
         "mad.lo.cc.u32   r3, a1, b2, r3;\n\t"
         "madc.hi.cc.u32  r4, a3, b0, r4;\n\t"
         "madc.hi.cc.u32  r5, a3, b1, r5;\n\t"
         "madc.hi.cc.u32  r6, a3, b2, r6;\n\t"
         "madc.hi.u32     r7, a3, b3, 0;\n\t"
         "mad.lo.cc.u32   r3, a2, b1, r3;\n\t"
         "madc.lo.cc.u32  r4, a1, b3, r4;\n\t"
         "madc.lo.cc.u32  r5, a2, b3, r5;\n\t"
         "madc.lo.cc.u32  r6, a3, b3, r6;\n\t"
         "addc.u32        r7, r7, 0;\n\t"
         "mad.lo.cc.u32   r3, a3, b0, r3;\n\t"
         "madc.lo.cc.u32  r4, a2, b2, r4;\n\t"
         "madc.lo.cc.u32  r5, a3, b2, r5;\n\t"
         "addc.cc.u32     r6, r6, 0;\n\t"
         "addc.u32        r7, r7, 0;\n\t"
         "mad.lo.cc.u32   r4, a3, b1, r4;\n\t"
         "addc.cc.u32     r5, r5, 0;\n\t"
         "addc.cc.u32     r6, r6, 0;\n\t"
         "addc.u32        r7, r7, 0;\n\t"
         "mov.b64         %0, {r0,r1};\n\t"  
         "mov.b64         %1, {r2,r3};\n\t"
         "mov.b64         %2, {r4,r5};\n\t"  
         "mov.b64         %3, {r6,r7};\n\t"
         "}"
         : "=l"(res.x), "=l"(res.y), "=l"(res.z), "=l"(res.w)
         : "l"(a.x), "l"(a.y), "l"(b.x), "l"(b.y));
#elif __CUDA_ARCH__ >= 500
    asm ("{\n\t"
         ".reg .u32 aa0, aa1, aa2, aa3, bb0, bb1, bb2, bb3;\n\t"
         ".reg .u32 r0, r1, r2, r3, r4, r5, r6, r7;\n\t"
         ".reg .u32 s0, s1, s2, s3, s4, s5, s6, s7;\n\t"
         ".reg .u32 t0, t1, t2, t3, t4, t5, t6, t7;\n\t"
         ".reg .u16 a0, a1, a2, a3, a4, a5, a6, a7;\n\t"
         ".reg .u16 b0, b1, b2, b3, b4, b5, b6, b7;\n\t"
         // unpack source operands
         "mov.b64         {aa0,aa1}, %4;\n\t"
         "mov.b64         {aa2,aa3}, %5;\n\t"
         "mov.b64         {bb0,bb1}, %6;\n\t"
         "mov.b64         {bb2,bb3}, %7;\n\t"
         "mov.b32         {a0,a1}, aa0;\n\t"
         "mov.b32         {a2,a3}, aa1;\n\t"
         "mov.b32         {a4,a5}, aa2;\n\t"
         "mov.b32         {a6,a7}, aa3;\n\t"
         "mov.b32         {b0,b1}, bb0;\n\t"
         "mov.b32         {b2,b3}, bb1;\n\t"
         "mov.b32         {b4,b5}, bb2;\n\t"
         "mov.b32         {b6,b7}, bb3;\n\t"
         // compute first partial sum
         "mul.wide.u16    r0, a0, b0;\n\t"
         "mul.wide.u16    r1, a0, b2;\n\t"
         "mul.wide.u16    r2, a0, b4;\n\t"
         "mul.wide.u16    r3, a0, b6;\n\t"
         "mul.wide.u16    r4, a1, b7;\n\t"
         "mul.wide.u16    r5, a3, b7;\n\t"
         "mul.wide.u16    r6, a5, b7;\n\t"
         "mul.wide.u16    r7, a7, b7;\n\t"
         "mul.wide.u16    t3, a1, b5;\n\t"
         "mul.wide.u16    t4, a2, b6;\n\t"
         "add.cc.u32      r3, r3, t3;\n\t"
         "addc.cc.u32     r4, r4, t4;\n\t"
         "addc.u32        r5, r5, 0;\n\t"
         "mul.wide.u16    t3, a2, b4;\n\t"
         "mul.wide.u16    t4, a3, b5;\n\t"
         "add.cc.u32      r3, r3, t3;\n\t"
         "addc.cc.u32     r4, r4, t4;\n\t"
         "addc.u32        r5, r5, 0;\n\t"
         "mul.wide.u16    t2, a1, b3;\n\t"
         "mul.wide.u16    t3, a3, b3;\n\t"
         "mul.wide.u16    t4, a4, b4;\n\t"
         "mul.wide.u16    t5, a4, b6;\n\t"
         "add.cc.u32      r2, r2, t2;\n\t"
         "addc.cc.u32     r3, r3, t3;\n\t"
         "addc.cc.u32     r4, r4, t4;\n\t"
         "addc.cc.u32     r5, r5, t5;\n\t"
         "addc.u32        r6, r6, 0;\n\t"
         "mul.wide.u16    t2, a2, b2;\n\t"
         "mul.wide.u16    t3, a4, b2;\n\t"
         "mul.wide.u16    t4, a5, b3;\n\t"
         "mul.wide.u16    t5, a5, b5;\n\t"
         "add.cc.u32      r2, r2, t2;\n\t"
         "addc.cc.u32     r3, r3, t3;\n\t"
         "addc.cc.u32     r4, r4, t4;\n\t"
         "addc.cc.u32     r5, r5, t5;\n\t"
         "addc.u32        r6, r6, 0;\n\t"
         "mul.wide.u16    t1, a1, b1;\n\t"
         "mul.wide.u16    t2, a3, b1;\n\t"
         "mul.wide.u16    t3, a5, b1;\n\t"
         "mul.wide.u16    t4, a6, b2;\n\t"
         "mul.wide.u16    t5, a6, b4;\n\t"
         "mul.wide.u16    t6, a6, b6;\n\t"
         "add.cc.u32      r1, r1, t1;\n\t"
         "addc.cc.u32     r2, r2, t2;\n\t"
         "addc.cc.u32     r3, r3, t3;\n\t"
         "addc.cc.u32     r4, r4, t4;\n\t"
         "addc.cc.u32     r5, r5, t5;\n\t"
         "addc.cc.u32     r6, r6, t6;\n\t"
         "addc.u32        r7, r7, 0;\n\t"
         "mul.wide.u16    t1, a2, b0;\n\t"
         "mul.wide.u16    t2, a4, b0;\n\t"
         "mul.wide.u16    t3, a6, b0;\n\t"
         "mul.wide.u16    t4, a7, b1;\n\t"
         "mul.wide.u16    t5, a7, b3;\n\t"
         "mul.wide.u16    t6, a7, b5;\n\t"
         "add.cc.u32      r1, r1, t1;\n\t"
         "addc.cc.u32     r2, r2, t2;\n\t"
         "addc.cc.u32     r3, r3, t3;\n\t"
         "addc.cc.u32     r4, r4, t4;\n\t"
         "addc.cc.u32     r5, r5, t5;\n\t"
         "addc.cc.u32     r6, r6, t6;\n\t"
         "addc.u32        r7, r7, 0;\n\t"
         // compute second partial sum
         "mul.wide.u16    t0, a0, b1;\n\t"
         "mul.wide.u16    t1, a0, b3;\n\t"
         "mul.wide.u16    t2, a0, b5;\n\t"
         "mul.wide.u16    t3, a0, b7;\n\t"
         "mul.wide.u16    t4, a2, b7;\n\t"
         "mul.wide.u16    t5, a4, b7;\n\t"
         "mul.wide.u16    t6, a6, b7;\n\t"
         "mul.wide.u16    s3, a1, b6;\n\t"
         "add.cc.u32      t3, t3, s3;\n\t"
         "addc.u32        t4, t4, 0;\n\t"
         "mul.wide.u16    s3, a2, b5;\n\t"
         "add.cc.u32      t3, t3, s3;\n\t"
         "addc.u32        t4, t4, 0;\n\t"
         "mul.wide.u16    s2, a1, b4;\n\t"
         "mul.wide.u16    s3, a3, b4;\n\t"
         "mul.wide.u16    s4, a3, b6;\n\t"
         "add.cc.u32      t2, t2, s2;\n\t"
         "addc.cc.u32     t3, t3, s3;\n\t"
         "addc.cc.u32     t4, t4, s4;\n\t"
         "addc.u32        t5, t5, 0;\n\t"
         "mul.wide.u16    s2, a2, b3;\n\t"
         "mul.wide.u16    s3, a4, b3;\n\t"
         "mul.wide.u16    s4, a4, b5;\n\t"
         "add.cc.u32      t2, t2, s2;\n\t"
         "addc.cc.u32     t3, t3, s3;\n\t"
         "addc.cc.u32     t4, t4, s4;\n\t"
         "addc.u32        t5, t5, 0;\n\t"
         "mul.wide.u16    s1, a1, b2;\n\t"
         "mul.wide.u16    s2, a3, b2;\n\t"
         "mul.wide.u16    s3, a5, b2;\n\t"
         "mul.wide.u16    s4, a5, b4;\n\t"
         "mul.wide.u16    s5, a5, b6;\n\t"
         "add.cc.u32      t1, t1, s1;\n\t"
         "addc.cc.u32     t2, t2, s2;\n\t"
         "addc.cc.u32     t3, t3, s3;\n\t"
         "addc.cc.u32     t4, t4, s4;\n\t"
         "addc.cc.u32     t5, t5, s5;\n\t"
         "addc.u32        t6, t6, 0;\n\t"
         "mul.wide.u16    s1, a2, b1;\n\t"
         "mul.wide.u16    s2, a4, b1;\n\t"
         "mul.wide.u16    s3, a6, b1;\n\t"
         "mul.wide.u16    s4, a6, b3;\n\t"
         "mul.wide.u16    s5, a6, b5;\n\t"
         "add.cc.u32      t1, t1, s1;\n\t"
         "addc.cc.u32     t2, t2, s2;\n\t"
         "addc.cc.u32     t3, t3, s3;\n\t"
         "addc.cc.u32     t4, t4, s4;\n\t"
         "addc.cc.u32     t5, t5, s5;\n\t"
         "addc.u32        t6, t6, 0;\n\t"
         "mul.wide.u16    s0, a1, b0;\n\t"
         "mul.wide.u16    s1, a3, b0;\n\t"
         "mul.wide.u16    s2, a5, b0;\n\t"
         "mul.wide.u16    s3, a7, b0;\n\t"
         "mul.wide.u16    s4, a7, b2;\n\t"
         "mul.wide.u16    s5, a7, b4;\n\t"
         "mul.wide.u16    s6, a7, b6;\n\t"
         "add.cc.u32      t0, t0, s0;\n\t"
         "addc.cc.u32     t1, t1, s1;\n\t"
         "addc.cc.u32     t2, t2, s2;\n\t"
         "addc.cc.u32     t3, t3, s3;\n\t"
         "addc.cc.u32     t4, t4, s4;\n\t"
         "addc.cc.u32     t5, t5, s5;\n\t"
         "addc.cc.u32     t6, t6, s6;\n\t"
         "addc.u32        t7, 0, 0;\n\t"
         // offset second partial sum by 16 bits
         "shf.l.clamp.b32 s7, t6, t7, 16;\n\t"
         "shf.l.clamp.b32 s6, t5, t6, 16;\n\t"
         "shf.l.clamp.b32 s5, t4, t5, 16;\n\t"
         "shf.l.clamp.b32 s4, t3, t4, 16;\n\t"
         "shf.l.clamp.b32 s3, t2, t3, 16;\n\t"
         "shf.l.clamp.b32 s2, t1, t2, 16;\n\t"
         "shf.l.clamp.b32 s1, t0, t1, 16;\n\t"
         "shf.l.clamp.b32 s0,  0, t0, 16;\n\t"
         // add partial sums
         "add.cc.u32      r0, r0, s0;\n\t"
         "addc.cc.u32     r1, r1, s1;\n\t"
         "addc.cc.u32     r2, r2, s2;\n\t"
         "addc.cc.u32     r3, r3, s3;\n\t"
         "addc.cc.u32     r4, r4, s4;\n\t"
         "addc.cc.u32     r5, r5, s5;\n\t"
         "addc.cc.u32     r6, r6, s6;\n\t"
         "addc.u32        r7, r7, s7;\n\t"
         // pack up result
         "mov.b64         %0, {r0,r1};\n\t"  
         "mov.b64         %1, {r2,r3};\n\t"
         "mov.b64         %2, {r4,r5};\n\t"  
         "mov.b64         %3, {r6,r7};\n\t"
         "}"
         : "=l"(res.x), "=l"(res.y), "=l"(res.z), "=l"(res.w)
         : "l"(a.x), "l"(a.y), "l"(b.x), "l"(b.y));
#elif __CUDA_ARCH__
#error unsupported __CUDA_ARCH__
#else // avoid warning   
    res.x = 0;
    res.y = 0;
    res.z = 0;
    res.w = 0;
#endif
    return res;
}

So if I compile PTXAS with the directive:

.target sm_20

and I use either:

mad.wide.u32

or:

mul.wide.u32

instructions, the performance will suffer on high-end GPU cards (those capable of compiling at sm_50 or above)?

Thanks for the info, BTW. It’s as helpful as it is disturbing. And it’s quite disturbing…

It is not a question or high-end or low-end GPUs. It is a question of different GPU architectures. The intermediate emulation introduces overhead, which is not very surprising. Not sure why one would consider it “disturbing”. Using native hardware instructions should be expected to deliver the best results.

You can experiment with the effect of IMAD emulation by removing the #if guards in the above code and compiling the first code variant (intended sm_2x and sm_3x) for sm_5x or sm_6x. Here is simple inline PTX code for a mul.wide.u64 via mul.wide.u32, if you want to give that a try:

asm ("{\n\t"
         ".reg .u32 p0l, p0h, p1l, p1h, p2l, p2h, p3l, p3h, r0, r1, r2, r3, alo, ahi, blo, bhi;\n\t"
         ".reg .u64 p0, p1, p2, p3;\n\t"
         "mov.b64         {alo,ahi}, %2;\n\t"
         "mov.b64         {blo,bhi}, %3;\n\t"
         "mul.wide.u32    p0, alo, blo;\n\t"
         "mul.wide.u32    p1, alo, bhi;\n\t"
         "mul.wide.u32    p2, ahi, blo;\n\t"
         "mul.wide.u32    p3, ahi, bhi;\n\t"
         "mov.b64         {p0l,p0h}, p0;\n\t"
         "mov.b64         {p1l,p1h}, p1;\n\t"
         "mov.b64         {p2l,p2h}, p2;\n\t"
         "mov.b64         {p3l,p3h}, p3;\n\t"
         "mov.b32         r0, p0l;\n\t"
         "add.cc.u32      r1, p0h, p1l;\n\t"
         "addc.cc.u32     r2, p1h, p2h;\n\t"
         "addc.u32        r3, p3h, 0;\n\t"
         "add.cc.u32      r1, r1, p2l;\n\t"
         "addc.cc.u32     r2, r2, p3l;\n\t"
         "addc.u32        r3, r3, 0;\n\t"
         "mov.b64         %0, {r0,r1};\n\t"  
         "mov.b64         %1, {r2,r3};\n\t"
         "}"
         : "=l"(res.x), "=l"(res.y)
         : "l"(a), "l"(b));

It’s disturbing to me, and hopefully everybody else, that more recent architectures will produce more inefficiencies and redundant instructions in the compiled machine code than the earlier architectures, from the same PTXAS source code. That is what you’re saying right? It should be the exact opposite situation.

As for using “native hardware instructions”, I don’t quite get the context here. Do you mean programming on a level below PTXAS? Not everyone has the luxury of knowing what GPU hardware their software is going to be run on, so that is not an option for many (including myself). If you mean PTXAS itself, then yes, it should be expected to deliver the best results. But from what you’re saying, those same “results” (i.e. the efficiency of the compiled code) is not ubiquitous across architectures, which I find “disturbing”, because my understanding is that PTXAS was supposed to be designed for that very purpose.

Anyway, my question actually pertained to whether or not what you were talking about would manifest for CUDA PTXAS kernel code that employed either the mad.wide.u32 or mul.wide.u32 instructions, as opposed to the mad.wide.u64 and/or mul.wide.u64 instructions. Apparently it does not, because you’ve proceeded to employ the mul.wide.u32 instruction itself in the ‘fix’ for what I assume is the problem you’re reporting.

If so, that’s a relief to me, because I just released software that uses it (both mad.wide.u32 & mul.wide.u32) quite extensively, and it is computationally sensitive to even the smallest of inefficiencies…

Insights?

PTX is a virtual architecture (and a compiler intermediate format). It is needed because the actual ISA of each generation of NVIDIA’s GPUs is not binary compatible with the previous one, like people might be used to from the x86 world.

Any analysis into efficiency of generated code has to look at the actual hardware instructions which you can see by disassembling a binary with cuobjdump --dump-sass, where SASS is the machine language provided by the hardware. If you do that, you will find that many PTX instructions are implemented as sequences of SASS instructions, some short, some lengthy. Analyzing GPU code at the PTX level is largely meaningless. NVIDIA does not publish details of the SASS instruction set, however a summary listing of the available instructions with one sentence description is part of the available CUDA documentation.

The translation of PTX to SASS is the job of PTXAS, which is the last of the CUDA compiler stages that are orchestrated by the compiler driver nvcc. Contrary to what the name may suggest, PXTAS is an optimizing compiler. Even so, it cannot work miracles when stitching together multiple emulation sequences such as those involved in the twenty or so PXT flavors of 32-bit IMUL/IMAD (of which about eight are needed to implement an efficient long-integer multiply in sm_2x and sm_3x, as you can tell from the code above).

GPUs are attractive in high performance computing because they offer high performance per device (and related, performance per square millimeter of silicon), high performance per watt, and high performance per dollar. It is the first of these metrics that causes NVIDIA to re-evaluate the silicon budget with every new architecture, and can lead to the removal instructions from hardware, as happened in the switch from IMAD to XMAD. Multipliers are “large” structures as computational units go, and the silicon for separate IMAD units that was saved could be dedicated to other performance features, while XMAD re-uses the hardware of the single-precision floating-point multipliers.

This plays into the second metric: If you look at the Green500 results, you will notice a very sizeable increase in energy efficiency going from Kepler to Maxwell, and again from Maxwell to Pascal. Obviously this wasn’t all just due to micro-architectural changes, but those certainly play a significant role.

With full adaptation of code, the overall throughput of most flavors of individual integer multiplies was preserved in the IMAD/XMAD switch, especially for the common cases. However, code that uses multiple different flavors of complicated IMAD versions, such as integer divisions or the long-integer multiply code discussed here, takes a bit of a performance hit unless it is re-written in terms of mul.wide.u16, at which point parity to previous platforms is roughly restored.

You can find tradeoffs of similar nature between computational units in all major processor families, including the dozen or so architecture generations of x86 processors since the PentiumPro of 1995. Other than for GPUs, instruction emulation code in x86 processors will be inside the CPU, as microcode, since binary compatibility must be preserved.

Depending on what exactly this computation does, it may well benefit from being rewritten in terms of mul.wide.u16 for sm_50 and later processors. You might want to tackle a sample routine, re-write it and see what the profiler has to say. If you want to get a rough idea for possible differences in instruction count and register pressure, you may want to simply build the three variants of mul.wide.u64 that I posted above, compile them for sm_50, then compare them with cuobjdump --dump-sass.

Yes, that’s so much clearer to me now. They dumped direct hardware support for integer multiplies in the later architectures, in favor of re-purposing the single-precision FP multipliers instead, which can only handle 16-bit integer multiplies, which consequently leads to a performance hit because more instructions, as well as more instruction modes, are generated, which, uh, confuses(?) the latent stages of the optimization process, to the point where it may fail, in parts, to recognize, and therefore eliminate, all instances of redundant computation…

I think.

I probably got a bit of that wrong, but again, it’s a whole lot clearer to me now, regardless.

Still think it should be reported as a bug though. I yam what I yam, and easy to please I yam not.

I’m so glad you included that paragraph. My cases are definitely what you would call “common”, so I’m not going to worry about it any longer. As always, thanks for the info and your patience.

Nah, I don’t think so. If I was going to go that far, I probably would just end up taking it all the way and getting rid of the multiply instruction entirely, by implementing a full-fledged 8-bit Digital Quarter Square Multiplier.

If I were working on code that is “sensitive to even the smallest of inefficiencies”, I would look into not just two, but several alternative ways of coding the computation. For example, for the mul.wide.u64 implementation, I looked into five other PTX code variants for sm_5x, before I decided it was time to utilize mul.wide.u16, which led to code with the smallest number of instructions and the least register pressure.

The thing to keep in mind is that PTXAS is a compiler, not an assembler, so on he first try the SASS popping out at the end is rarely exactly what an expert programmer had in mind.

FWIW, the quarter-square multiplication algorithm was already obsolete by the time I started x86 assembly language programming in 1982 (on a Sirius Victor; or was it the Victor Sirius?). Despite the horribly slow MUL/IMUL instructions on the 8086, it was still the best building block for long-integer multiplies. My memory is hazy, but I seem to recall going via MUL beat the best alternatives by about 20-30%.

But you’re comparing apples to oranges by assuming that x86 performance translates to SASS performance. Case in point: a while back I implemented the Karatsuba algorithm on both the x86 (in asm) and PTX ISA. Using the grade-school algorithm as a baseline, the x86 achieved the following speedups, by dword (32-bit) block size:

16: 160.9% of grade school algorithm’s execution speed (slower)…
32: 157.1% …(slower)…
64: 124.5% …(slower)…
128: 95.1% …(faster)…
256: 73.6% …(faster)…
512: 51.0% …(faster)…
1024: 30.3% …(faster)…
2048: 26.9% …(faster)…
4096: 23.4% …(faster)…
8192: 17.4% …(faster)…
16384: 12.5% …(faster)…
32768: 9.2% …(faster - as in much)…

When I implemented the exact same algorithm in PTX ISA, translating the x86 code as closely as I could into PTX ISA, no speed up at all was achieved compared to the PTX ISA implementation of the grade-school algorithm, regardless of chosen block size.

And yes, I’m fully aware that PTX ISA is compiled, whereas asm isn’t. But that’s kind of the point.

In fact, the Karatsuba implementation in PTX ISA would consistently test out slower (but, to be fair, not by much) than the grade-school implementation.

So your (apparently unequivocal) statement declaring that the quarter-square multiplication algorithm is “obsolete” for all computational environments, based solely on (your) experience with the x86, is, I believe, not entirely accurate, because the differences in algorithmic performance between the two computational environments is, as I have tried to illustrate, prodigious enough to cast serious doubt on such a blanket condemnation of that, or any particular algorithm.

Just saying…

The problem with Karatsuba on the GPU is divergence, or if you avoid divergence, redundant computation. I should say that I only tried relatively short integers so far, but my experiments were as unsuccessful as yours so far.

I didn’t say or mean to imply that the quarter-square multiplication approach is obsolete for all conceivable processor platforms. But I would certainly think it is obsolete for any processors with multiplication capability provided by the hardware and high-performance computing platforms like the GPU in particular.

If you work with very low-end $1 microcontrollers similar in capability to the 6502 or Z80 CPU of old, the quarter-square multiplication algorithm might well be useful today. The last time I dealt with fairly low-end processors I used ~200 MHz ARM processors with very early NVIDIA mobile products ca. 2003, and even those had a zippy (5 cycle?) 32x32->64 bit multiply instruction built in.

That’s a really tempting invite, but for now I’ll have to give that a pass, only because I’m really busy right now with other, non-software-related (aka: mundane) things… After things settle down though, I think I’d rather enjoy seeing those numbers… I have no idea when that’ll be though…

Anyway, I’ve given some additional thought to NVidia’s decision to trash the hardware support for integer multiplication (along the lines that you’ve outlined), and I find myself disagreeing with that decision wholeheartedly. Integer multiplication for values higher than 16 bits is essential for both efficient address calculations and unique thread identification (which are related), both of which have to be employed in abundance in any CUDA thread, especially those that use 64-bit addressing and/or those that aim to be as non-divergent as possible. And the fact that you can’t reference on-board memory as a register-indexed immediate operand (like on the x86 and other processor architectures) within the confines of the PTX ISA instruction set (and probably SASS too), only serves to compound the problem.

Which is to say, integer multiplication, and the ability to handle it as efficiently as possible (i.e. in hardware), I would think is as important, if not more important, than any minor uptick in silicon real-estate that might be garnered from its removal.

And this is the part where I’m supposed to say, “but that’s just me”. But actually, I’m hoping that it isn’t just me.

As I said, better implementations of the two multiplication primitives I posted likely do exist, and it would be to the benefit of the general public if they could be posted to this thread (attaching a non-restrictive standard open-source license will allow others to use such code as they see fit).

Keep in mind that NVIDIA is a business, not a research organization. GPU architectures evolve to benefit the largest number of purchasing customers so the company’s revenues and profits can be maximized. Like any company its size, NVIDIA has a number technical and non-technical people tasked with engaging with customers to understand what features the market requires. NVIDIA’s economic success is a pretty good indication that the company does understand market requirements well.

Realistically, it’s impossible for any company to make all customers happy all the time (it is a laudable goal, though). When I first heard about the removal from IMAD from the next generation of GPUs, it was clear to me hat this would negatively affect certain specialized codes such as the long-integer computations discussed here. However, from what I can see the market for such computations isn’t very large. In addition any performance lost in the transition can be regained by rewriting the computations to make the best use of XMAD possible, from what I can tell so far. So there does not seem to be any lasting harm.

Best I can tell, common cases of integer multiplies, such as are used in addressing arithmetic have not been negatively affected by the IMAD/XMAD switch. In fact, a simple 32-bit multiply now takes three XMAD instructions that collectively execute at higher throughput (same rate as simple integer arithmetic) than the old IMUL which executed a 1/4 of the throughput of simple integer arithmetic. It should also be noted that many multiplications in addressing arithmetic tend to disappear once the compiler applies strength reduction (a standard optimization).

I can only speculate, but I think it is highly unlikely that 32-bit IMAD will be restored in future GPU generations. The positive cost/benefit effect just does not seem to be there to justify the hardware expenditure. This is different from the reduction in SFU (special function unit) throughput that NVIDIA tried for one GPU generation early on in the life of CUDA, where it turned out that enough use cases (and customers) were impacted negatively that NVIDIA promptly restored the “customary” SFU throughput in the next GPU generation.

If this refers to a [register + immediate offset] addressing mode, that is something GPUs support last time I checked. Random snippet of disassembled code:

/*0198*/                   LDG.E.64 R28, [R6+0x18];
/*01a8*/                   LDG.E.64 R16, [R6+0x10];
/*01b0*/                   LDG.E.64 R14, [R6+0x28];
/*01b8*/                   LDG.E.64 R12, [R6+0x20];

Interesting to read that Nvidia is reusing the single precision multipliers for 16 bit integer multiply. I wasn’t quite sure whether that was the case or not.
While reducing the exposed width to a power of 2 certainly tidied up the interface compared to CC 1.x, I still wonder though whether at some point Nvidia is going to expose the full 24 bits again. With the growing importance of long address calculations, a 64x64 bit multiplication could be achieved with 9 instead of 16 multiplication instructions. For something whose biggest cost would probably be the opcode space it takes up, that seems like an attractive proposition.

I do not actually know this for a fact, so I should have added appropriate weasel words. Consider them added now :-). I concluded this hardware sharing from the following facts: (1) high throughput of XMAD, (2) three-input instruction, (3) FFMA path forms a 48-bit product so a 32-bit product “fits”.

If you still have a CUDA toolchain around that can build code for sm_1x, you might want to look at the emulation sequence for 32-bit integer multiplies. I am fairly certain it was composited from 16-bit operations, but my memory is hazy. The required shifting to add partial products is cumbersome with 24-bit operands, this is where XMAD’s PSL mode comes in (“product shift left”). The design of XMAD is clever and elegant in that it supports a lot of features with likely minimal hardware cost.

That’s really good to know. Thanks for that.

Actually, I was referring to the following restrictions in PTX ISA:

.version 2.3
.target sm_20
.address_size 64

.global.u32 Srce[ 4096 ];
.global.u32 Dest[ 4096 ];

.entry Foobar
{
    .reg.u64 GUTO;  // <== "Globally Unique Thread Ordinal"..

    .reg.u64 Addr;
    .reg.u32 BlkID, ThrdID, ThrdspBlk, Valu;

    mov.u32 BlkID, %ctaid.x;
    mov.u32 ThrdspBlk, %ntid.x;
    cvt.u64.u32 GUTO, %tid.x;
    mad.wide.u32 GUTO, ThrdspBlk, BlkID, GUTO;

    shl.b64 GUTO, GUTO, 2;    // <== ..byte index into Srce[]..

//  ld.global.u32 Valu, Srce[GUTO];     // ..Nope; no can do. compile error..
//  ld.global.u32 Valu, [GUTO+Srce];    // ..Never tried it this way, but doubt it would work..
//  ld.global.u32 Valu, [Srce+GUTO];    // ..ditto this..

    cvta.global.u64 Addr, Srce;     // this, and
    add.u64 Addr, Addr, GUTO;       // this, and finally
    ld.global.u32 Valu, [Addr];     // this, will compile..

//  ..more code..

    ret.uni;
}

So perhaps not a restriction in the actual SASS - that was just a guess…

If you mean “Do GPU have an addressing mode with a built-in scale factor like the SIB-mode on x86 (which can scale by 1x, 2x, 4x, 8x)?” the answer is that no such addressing mode exists on the GPU (to the best of my knowledge; I haven’t checked Pascal). To first order, GPU’s native ISA is best described as a minimalist RISC ISA.

The problems of dealing with a 64-bit address space on what is basically a 32-bit machine have been around ever since GPUs supported 64-bit platforms (because in CUDA, device data types match host data types), i.e. a long time ago. The biggest problem that arises from 64-bit addressing used to be register pressure, not the overhead of index calculations involving multiplies. With the Maxwell architecture and beyond, the register pressure issues have pretty much disappeared.

And by observation of generated code the compiler seems to strength reduce index computations less aggressively than it used to do five years ago, leaving more multiplies in place. I have to assume it is intentional and another indication that integer multiplies for address computation are not a significant performance issue.

I still wonder though whether at some point Nvidia is going to expose the full 24 bits again

Volta separated FP32 and INT32 alus

although, BTW, i wild guess that SM 7.1 will have FP32 engine plus combined FP32/INT32 engine in each dispatcher. It will allow to double FP32 marketing numbers without adding much silicon. And then FP32 engine can implement a few simpler INT operations, making overall architecture similar to CPU having one full ALU plus one simple ALU

I have updated the code path for >= sm_50 for the 128 x 128 -> 256 bit multiplication in my original post. I tightened the code as much as possible by assembling the result directly from the 64 partial 32-bit products produced by mul.wide.u16. This saved around twenty instructions.

I wonder if there’s any efficiency gain to be had from applying the Karatsuba multiplication rule instead of the usual schoolbook approach. Essentially it’s trading multiplications for additions/subtractions (which work on the full 32 bit register width, and hence could save a few instructions)

As I mentioned earlier in this thread, I have not had any success trying to use Karatsuba to speed up the code for these small sizes on the GPU. I’d love to see your code if you have managed to do that.

Ah, apologies for not reading your previous postings very thoroughly.

At this time I feel more inclined to use your code as a low level building block to build a Karatsuba based multiplication for somewhat longer integers (say 1024 bits). At a later time I may go back to the lowlevel PTX code to see if I can find any further speedups.