32-bit number multiplication

TheJudger · November 15, 2011, 10:02am

Yes, that’s why I’ve written that you mul_96_192() looks good to me.

Correct. Those numbers are for my barrett79 kernel which handles FCs up to 2^79.

Main loop:

square_96_160() /* 96 bit input (must be < 2^79), 160bit output */
mul_96_192_no_low3() /* better name would be mul_96hi(), caculate upper 96 bits of a 96x96 multiplication with some loss of accuracy (no carrys from lower half) */
mul_96()
sub_96()
a single limb division as adjustment because plain barrett returns N = AB mod M where N < 3M!

There are two kernels which do classical division and mul32.[lo|hi].u32. The two kernels are for FCs up to 2^75 and 2^95. The 2^75 kernel is just a stripped down version of the 2^95 kernel. The 2^95 kernel is good for 45-50% performance compared to the barrett79 kernel on my GTX 470. So the classical division (192/96 → 96) is not that bad. I’m using a floatingpoint approximation for the quotient, 20 bits each step.

Oliver

P.S. some more PTX coding (more mad with carry) now yields 335M/s on my GTX 470.

elect · June 21, 2012, 7:03am

Hi guys,

I update this thread because I am going into a new Cuda project, a modified version of the RNS Montgomery Exponentiation

I am going to execute A LOT operations like: multiplication between 32 bit numbers and then on the 64 bit result, I need to apply a 32 bit modulo…

At the moment, I dispose of a 560 Ti, 580 and 680…

So is there any trick or special lib that show the best performances with mul+mod or the old-school PTX is still the way to go (on Fermi and Kepler)?

ymc · June 21, 2012, 7:58am

Is it true that there will native 64-bit integer operations support in GK110???

elect · July 1, 2012, 11:33am

Something like this:

asm ("{\n\t"

		".reg .u64 t1;\n\t"

		".reg .u64 t2;\n\t"

		"mul.wide.u32	t1, %1, %2;\n\t"  // t1 = x[i] * y[i]

		"cvt.u64.u32	t2,     %3;\n\t"

		"rem.u64		t1, t1, t2;\n\t"  // t1 = t1 mod t2

		"mov.b32		%0,   t1+0;\n\t" 

		"}"         

		: "=r"(__qB[threadIdx.x]): "r"(__q[threadIdx.x]), "r"(__Bi_inAUar[threadIdx.x*(SET_SIZE+1) + blockIdx.x]), "r"(__bases[blockIdx.x]))0;

Topic		Replies	Views
long-integer multiplication: mul.wide.u64 and mul.wide.u128 CUDA Programming and Performance	31	8293	January 2, 2018
Multiplication & remainders for large numbers. CUDA Programming and Performance	9	2647	August 4, 2016
Cuda 3.5 Integer Multiply Performance Is it really 3x slower than 64-bit floating point? CUDA Programming and Performance	21	20308	March 12, 2014
Inst_fp_32 and inst_fp_64 metrics CUDA Programming and Performance	9	1730	April 7, 2018
Modular exponentiation & BigInteger CUDA Programming and Performance	12	8512	September 28, 2017
Faster division on 64-bit unsigned integers CUDA Programming and Performance	4	2262	July 15, 2017
Measurements of different CUDA operator throughputs CUDA Programming and Performance	32	50382	August 24, 2009
32-bit multiplication and 64-bit registers CUDA Programming and Performance	6	6218	December 10, 2008
error in modulo operation CUDA Programming and Performance	12	16258	September 20, 2009
Speed comparison of division compared to other arithmetic operations, perhaps something like clock cycles CUDA Programming and Performance	9	6727	November 19, 2024

32-bit number multiplication

Related topics