Yes, that’s why I’ve written that you mul_96_192() looks good to me.
Correct. Those numbers are for my barrett79 kernel which handles FCs up to 2^79.
Main loop:
square_96_160() /* 96 bit input (must be < 2^79), 160bit output */
mul_96_192_no_low3() /* better name would be mul_96hi(), caculate upper 96 bits of a 96x96 multiplication with some loss of accuracy (no carrys from lower half) */
mul_96()
sub_96()
a single limb division as adjustment because plain barrett returns N = AB mod M where N < 3M!
There are two kernels which do classical division and mul32.[lo|hi].u32. The two kernels are for FCs up to 2^75 and 2^95. The 2^75 kernel is just a stripped down version of the 2^95 kernel. The 2^95 kernel is good for 45-50% performance compared to the barrett79 kernel on my GTX 470. So the classical division (192/96 → 96) is not that bad. I’m using a floatingpoint approximation for the quotient, 20 bits each step.
Oliver
P.S. some more PTX coding (more mad with carry) now yields 335M/s on my GTX 470.