32-bit number multiplication

Hi all,

based on some test, implementation with PTX is slower than normal code…

But I would like to speed it up even more.

So I ask you if there is a faster technique to multiply very long numbers (ex: 1024 bits) based on 32-bit limbs than this one:

for(j=0; j<8; j++)	{

		a=l_y[7 - j];


		for(i=0; i<8; i++)	{

			b=l_x[7 - i];

			t=s_x[15 - i - j + offset];

			low = t + c;

			c = (UINT_MAX - c) < t;

			c += (UINT_MAX - low) < (a*b);

			low += a*b;

			c += __umulhi(a,b);

			s_x[15 - i - j + offset]=low;


		s_x[15 - i - j + offset]+=c;

Image the result like [c,low]

l_y and l_x are the factors, s_x the result

t is the partial result in s_x to be added to the next partial multiplication

I think that I could save something since I am running a*b two/three times… am I right?

I thought I could save a*b in a 64bit integer and then shift it forward and backward to save it on two 32-bit unsigned integers… what do you think about?

for 1024 bit
with 3 fft with 128 entry you can do but i don t know if it s faster
or look at karatsuba on internet

I am actually using Karatsuba, what you see there is just a (sub)addition inside the process

The carry computation in HLL typically results in fairly inefficient code. With PTX inline assembly you should be able to do better, as there is direct access to the carry. I have only coded up to a 64x64->128 bit multiplication using PTX, and that still used the paper & pencil algorithm. As I recall that compiled to 15 instructions or so. If there is interest in the code, I can dig it up and post it here. We are looking into ways of improving PTX support to allow for more efficient multi-word multiplies; from a hardware perspective the 64x64->128 bit multiplication should only require 9 instructions (or thereabouts) on sm_2x.

I have not yet established at what point use of Karatsuba starts making sense. Based on my experience on CPUs, as numbers get longer, one would first want to switch from longhand multiplication to Karatsuba, then to Toom-Cook, and finally to FFT. The FFT approach typically requires multi-thousand bit operands before it is more efficient than Toom-Cook. It would be an interesting exercise to determine the proper switch-over points for a Fermi-class GPU, unfortunately I have not found the time yet. I am also not aware of published papers on the topic.


I found my 64x64->128 bit multiplication code using inline PTX. 14 PTX instructions that compile to 12 machine instructions on sm_2x. I would consider this a useful building block for even wider multiplies on Fermi-class GPUs.

__device__ __forceinline__ uint4 my_mul_64_128 (uint2 a, uint2 b)


    uint4 res;

asm ("{\n\t"

         ".reg .u32 t1;\n\t"

         "mul.lo.u32      %0, %4, %6;\n\t"  // (a.lo * b.lo).lo

         "mul.hi.u32      %1, %4, %6;\n\t"  // (a.lo * b.lo).hi

         "mul.lo.u32      %2, %5, %7;\n\t"  // (a.hi * b.hi).lo

         "mul.hi.u32      %3, %5, %7;\n\t"  // (a.hi * b.hi).hi

         "mul.lo.u32      t1, %4, %7;\n\t"  // (a.lo * b.hi).lo

         "add.cc.u32      %1, %1, t1;\n\t"  // add (a.lo * b.hi).lo to result

         "mul.hi.u32      t1, %4, %7;\n\t"  // (a.lo * b.hi).hi

         "addc.cc.u32     %2, %2, t1;\n\t"  // add (a.lo * b.hi).hi to result

         "addc.u32        %3, %3, 0 ;\n\t"  // propagate carry to highest word

         "mul.lo.u32      t1, %5, %6;\n\t"  // (a.hi * b.lo).lo

         "add.cc.u32      %1, %1, t1;\n\t"  // add (a.hi * b.lo).lo to result

         "mul.hi.u32      t1, %5, %6;\n\t"  // (a.hi * b.lo).hi

         "addc.cc.u32     %2, %2, t1;\n\t"  // add (a.hi * b.lo).hi to result

         "addc.u32        %3, %3, 0 ;\n\t"  // propagate carry to highest word


         : "=r"(res.x), "=r"(res.y), "=r"(res.z), "=r"(res.w)

         : "r"(a.x), "r"(a.y), "r"(b.x), "r"(b.y));

    return res;


Thanks njuffa for your contribution…

However, just a question, what about using a double register instead performing twice a “mul” (.lo and .hi)? Isnt this more convenient?

How many inputs/outputs are truly possible for PTX in 2.x?

I ported the above from some 32-bit SPARC code I wrote twenty years ago :-) Keep in mind that if you use 64-bit integer operations in PTX, you are actually invoking emulation code sequences as there is no native hardware support for 64-bit integers (other than for conversion from and to floating-point). It is an open question wether a combination of the various snippets of emulation code would be more efficient than directly coding in 32-bit operations that map to hardware almost one-to-one. Personally, I prefer to stick closely to what the hardware offers for code like this.

I am not sure what you are asking. Is your question how many arguments can be bound to registers by an asm() statement? If so, I do not know the answer to that question as I am not aware of a specific upper limit. Fermi supports up to 63 32-bit registers per thread, that may be a bigger limiter in practical terms than possible restrictions on the number of registers that can be bound to an asm() statement. Please note that an asm() statement binds virtual PTX registers, not physical registers, and there is a vast supply of virtual registers.

With CUDA 4.1 (RC1 became available to registered developers today) it is possible to reduce the 64x64->128 bit multiplication down to just nine instructions, provided one targets compute capability 2.0 or higher. You might want to give this a try.

__device__ __forceinline__ uint4 new_mul_64_128 (uint2 a, uint2 b)


    uint4 res;

    asm ("{\n\t"

         "mul.lo.u32      %0, %4, %6;\n\t"

         "mul.hi.u32      %1, %4, %6;\n\t"

         "mad.lo.cc.u32   %1, %5, %6, %1;\n\t"

         "madc.hi.u32     %2, %5, %6, 0;\n\t"

         "mad.lo.cc.u32   %1, %4, %7, %1;\n\t"

         "madc.hi.cc.u32  %2, %4, %7, %2;\n\t"

         "addc.u32        %3, 0, 0;\n\t"

         "mad.lo.cc.u32   %2, %5, %7, %2;\n\t"

         "madc.hi.u32     %3, %5, %7, %3;\n\t"


         : "=r"(res.x), "=r"(res.y), "=r"(res.z), "=r"(res.w)

         : "r"(a.x), "r"(a.y), "r"(b.x), "r"(b.y));

    return res;



The line that read “addc.u32 %3, %3, 0;” must instead be “addc.u32 %3, 0, 0;” as %3 has not been initialized. I have applied that correction above. Sorry for any confusion posting the incorrect version may have caused.

cool, any idea if there is a performance gain for cc 2.0/2.1 devices? Or does this just saves a register and has less code “only”?

preface: code is out of my mind, hopefully I didn’t screw lo- and hiword order…

__device__ __forceinline__ uint4 new2_mul_64_128 (uint2 a, uint2 b)


    uint4 res;

    asm ("{\n\t"

         ".reg .u64 t1, t2, t3, t4;\n\t"

         "mov.b64         t1, {%4, %5};\n\t"

         "mov.b64         t2, {%6, %7};\n\t"

         "mul.lo.u64      t3, t1, t2;\n\t"

         "mul.hi.u64      t4, t1, t2;\n\t"

         "mov.b64         {%0, %1}, t3;\n\t"

         "mov.b64         {%2, %3}, t4;\n\t"


         : "=r"(res.x), "=r"(res.y), "=r"(res.z), "=r"(res.w)

         : "r"(a.x), "r"(a.y), "r"(b.x), "r"(b.y));

    return res;


I’ve tested this once (actually I’ve tested 96x96 -> 192bit multiplications), it seems to be slower than your first approach!


If code is instruction thoughput bound, the new MADC exposed in PTX will help performance (it has been there in the sm_2x hardware all along, but there was no good way to get at it). As I mentioned before, 64-bit integer operations in PTX map to emulation sequences at machine code level, so a combination of those emulation sequences is unlikely to be as efficient as code that basically maps to hardware instructions one-to-one.

Hi njuffa,

I’ll try when I get CUDA 4.1. I need to wait until it is public available because I’m not a “registered developer”.
My code is not memory bandwidth bound but I’m not sure whether my code is compute bound or instruction throughput bound.


Why not sign up as a registered developer now, so you can try right away? I signed up as a registered developer myself some while ago (mostly so I can see the exact same contents and environment external developers see) and there is one online form to fill, then there is a wait of a few days for approval, and you are good to go. Just make sure you don’t leave out any of the important information on the form. If you run tino problem, feel free to send me a personal message via the forum.

Hi njuffa,

I’ve just requested a login. I’m curios if my hobby project qualifies for a registered developer account or not.
Anyway, just a matter of time until I receive CUDA 4.1. I’ll post results when I have them.

– edit –
I’ve just received my login, much less than those “up to 5 days” needed. :)


Hi njuffa,

I like CUDA 4.1. For my code it is faster than all other releases before, not much faster but it is faster.

mad[c].[lo|hi][.cc].u32 yielded additional 1-2% performance for my code (quick replacement of some mul.[lo|hi].u32 / add[c][.cc].u32 instructions). I guess I can improve this further with more time.


Just if you (or others) are curios what my code does: Trail factoring of mersenne numbers for the GIMPS project: http://www.mersenne.org

Source and executables of the current released version (without the new mad instructions!): http://www.mersenneforum.org/mfaktc/

Discussion on mersenneforum: http://www.mersenneforum.org/showthread.php?t=12827

Additional information: http://www.mersennewiki.org/index.php/Mfaktc

Good to hear you are having a positive experience with CUDA 4.1 RC1.

Is your current code using 32-bit limbs as you mentioned in your original code? The main reason I brought up ways of performing efficient 64x64->128 bit multiplication on the GPU is because I suspect that use of 64-bit limbs may lead to higher performance even though the hardware has no support for 64-bit integer arithmetic.

Hi njuffa,

yes, 32bit limbs for 96/192 bit integer arithmetic. Squaring a number mod F with a different F for each thread.


What is the possible range for F? Are you currently implementing this as a brute-force 96x96->192 bit squaring followed by a separate modulo operation with F, or are the two operations somehow cleverly combined? Having worked on modular long integer arithmetic a long time ago I find this to be an interesting test case for the improved support for long integer arithmetic implemented in sm_2x.

Hi njuffa,

the upper limit for F in my code is 2^95 but pratically it is more like 2^70 to 2^80 for current runs. Depending on the mersenne number to factor there are ~20-30 repetitions for each F.
Squaring followed by mod, depending on the kernel the mod is done with classical division (not that bad!) or barretts modular reduction. Montgomery reduction doesn’t seem feasible for such a small number of repetitions with same F.

You’re right, sm_2x is so much faster than sm_1x, GTX 580 beats GTX 285 by more than factor 5.

  • fast mul32 (sm_2x) rocks!
  • using all 32 bits per limb and add with carry avoids lots of shifts for carry handling
    There is one kernel which uses 24 bits per limb (3/6 limbs - 72/144 bits) which is the fastest for sm_1x.


Good to hear sm_2x is working so well for your application. Looking at the 96x96->192 bit multiplication problem, it seems like it could be doable in 21 instructions, but I need to test the code thorougly first before I post. It is all too easy to introduce bugs (as demonstrated by the correction I had to make to the 64x64->128 bit multiplication code above; a bug in the multiplication code was masked by a bug in my test app). Since the IMADs are cheap on sm_2x, it may not be worthwhile to provide a special squaring routine once the multiply is working.


OK, the 96x96->192 bit multiplication code seems to be working (unless I have a bug in my test app). The 192/96->96 bit division for the reduction is a somewhat harder problem. CUDA uses Newton-Raphson for efficient 32-bit and 64-bit integer division and modulo, but this case is a bit different since there is a double-wide dividend. Long-hand division with radix 2**32 using the Pope-Stein algorithm could be more appropriate.

__device__ __forceinline__ void mul_96_192 (const unsigned int *a, 

                                            const unsigned int *b,

                                            unsigned int *p)


    unsigned int a0 = a[0]; 

    unsigned int a1 = a[1];

    unsigned int a2 = a[2];

    unsigned int b0 = b[0]; 

    unsigned int b1 = b[1];

    unsigned int b2 = b[2];

    unsigned int p0, p1, p2, p3, p4, p5;

asm ("{\n\t"

         "mul.lo.u32      %0, %6,  %9    ;\n\t"  /* (a0 * b0)_lo */

         "mul.hi.u32      %1, %6,  %9    ;\n\t"  /* (a0 * b0)_hi */

         "mad.lo.cc.u32   %1, %6, %10, %1;\n\t"  /* (a0 * b1)_lo */

         "madc.hi.u32     %2, %6, %10,  0;\n\t"  /* (a0 * b1)_hi */

         "mad.lo.cc.u32   %1, %7,  %9, %1;\n\t"  /* (a1 * b0)_lo */

         "madc.hi.cc.u32  %2, %7,  %9, %2;\n\t"  /* (a1 * b0)_hi */

         "madc.hi.u32     %3, %6, %11,  0;\n\t"  /* (a0 * b2)_hi */

         "mad.lo.cc.u32   %2, %6, %11, %2;\n\t"  /* (a0 * b2)_lo */

         "madc.hi.cc.u32  %3, %7, %10, %3;\n\t"  /* (a1 * b1)_hi */

         "madc.hi.u32     %4, %7, %11,  0;\n\t"  /* (a1 * b2)_hi */

         "mad.lo.cc.u32   %2, %7, %10, %2;\n\t"  /* (a1 * b1)_lo */

         "madc.hi.cc.u32  %3, %8,  %9, %3;\n\t"  /* (a2 * b0)_hi */

         "madc.hi.cc.u32  %4, %8, %10, %4;\n\t"  /* (a2 * b1)_hi */

         "madc.hi.u32     %5, %8, %11,  0;\n\t"  /* (a2 * b2)_hi */

         "mad.lo.cc.u32   %2, %8,  %9, %2;\n\t"  /* (a2 * b0)_lo */

         "madc.lo.cc.u32  %3, %7, %11, %3;\n\t"  /* (a1 * b2)_lo */

         "madc.lo.cc.u32  %4, %8, %11, %4;\n\t"  /* (a2 * b2)_lo */

         "addc.u32        %5, %5,   0;    \n\t"  /* propagate carry */

         "mad.lo.cc.u32   %3, %8, %10, %3;\n\t"  /* (a2 * b1)_lo */

         "addc.cc.u32     %4, %4,   0    ;\n\t"  /* propagate carry */

         "addc.u32        %5, %5,   0    ;\n\t"  /* propagate carry */


         : "=r"(p0), "=r"(p1), "=r"(p2), "=r"(p3), "=r"(p4), "=r"(p5) 

         : "r"(a0), "r"(a1), "r"(a2), "r"(b0), "r"(b1), "r"(b2));

p[0] = p0;

    p[1] = p1;

    p[2] = p2;

    p[3] = p3;

    p[4] = p4;

    p[5] = p5;    


Hello njuffa,

your mul_96_192() looks good to me.

Well, this depends on some other factors. If you want to calculate RES = AA and you know that A is allways <2^95 (MSB never set) than squaring can be done faster. First calculate (a2a2).lo and (a2a2).hi. Next step shiftleft a2 one bit. (a0a2).[lo|hi] and (a1*a2).[lo|hi] are needed twice for squaring. So you can replace 4 mad’s with one shiftleft.

If A is even smaller (e.g. <2^94) you can reorder those mad’s to save some carry stages near the end. I’m talking about thoses add’s:

"addc.u32        %5, %5,   0;    \n\t"  /* propagate carry */

If A is less than 2^80 it is save to ignore the most significant word of the result.

FC = factor candidate

This doesn’t save much time but I’m down to ~85 clock cycles for a single “squaring mod FC” in my fastest kernel which handles FCs up to 2^79.

GTX 470: 1215MHz * 448 Cores / (~325M FCs tested per second * ~20 squarings mod F per FC)


With knowledge of the operand ranges it is certainly possible to apply further optimizations like the ones you outlined. Not having any detailed knowledge of the factoring I simply wrote a general purpose multiply.

If you are down to 85 cycles per squaring mod F, I assume this is not using a division for the modulo operation but Barrett’s reduction? I eyeballed the 192/96->96 bit longhand division using Pope-Stein and it seems that would be on the order of 250 cycles, but it’s hard to know without working out the details. I was thinking of building a 64/32->32 bit division on top of the 32/32 bit division, plus a 96x32->128 bit multiplication. Each quotient digit base 2**32 would use one of those plus one to three 128-bit adds/subtracts, and one would need three of the quotient digit steps to get the 96-bit modulo and 96-bit remainder.