32-bit number multiplication

Yes, that’s why I’ve written that you mul_96_192() looks good to me.

Correct. Those numbers are for my barrett79 kernel which handles FCs up to 2^79.

Main loop:

  • square_96_160() /* 96 bit input (must be < 2^79), 160bit output */

  • mul_96_192_no_low3() /* better name would be mul_96hi(), caculate upper 96 bits of a 96x96 multiplication with some loss of accuracy (no carrys from lower half) */

  • mul_96()

  • sub_96()

  • a single limb division as adjustment because plain barrett returns N = AB mod M where N < 3M!

There are two kernels which do classical division and mul32.[lo|hi].u32. The two kernels are for FCs up to 2^75 and 2^95. The 2^75 kernel is just a stripped down version of the 2^95 kernel. The 2^95 kernel is good for 45-50% performance compared to the barrett79 kernel on my GTX 470. So the classical division (192/96 → 96) is not that bad. I’m using a floatingpoint approximation for the quotient, 20 bits each step.

Oliver

P.S. some more PTX coding (more mad with carry) now yields 335M/s on my GTX 470.

Hi guys,

I update this thread because I am going into a new Cuda project, a modified version of the RNS Montgomery Exponentiation

I am going to execute A LOT operations like: multiplication between 32 bit numbers and then on the 64 bit result, I need to apply a 32 bit modulo…

At the moment, I dispose of a 560 Ti, 580 and 680…

So is there any trick or special lib that show the best performances with mul+mod or the old-school PTX is still the way to go (on Fermi and Kepler)?

Is it true that there will native 64-bit integer operations support in GK110???

Something like this:

asm ("{\n\t"

		".reg .u64 t1;\n\t"

		".reg .u64 t2;\n\t"

		"mul.wide.u32	t1, %1, %2;\n\t"  // t1 = x[i] * y[i]

		"cvt.u64.u32	t2,     %3;\n\t"

		"rem.u64		t1, t1, t2;\n\t"  // t1 = t1 mod t2

		"mov.b32		%0,   t1+0;\n\t" 

		"}"         

		: "=r"(__qB[threadIdx.x]): "r"(__q[threadIdx.x]), "r"(__Bi_inAUar[threadIdx.x*(SET_SIZE+1) + blockIdx.x]), "r"(__bases[blockIdx.x]))0;