Multiplication & remainders for large numbers.

shinwano · November 22, 2010, 9:09pm

Hello.

I am trying to implement 1024 bit RSA encryption on CUDA. in order to do this, i need to perform wide operations such as 1024-bit wide multiplication . Can anyone help me? how can this be made parallel?

also, is there an efficient way to get the modulus for a wide division such as this (in parallel?). I mean like “2048bits”%“1024bits”.

Any advice would be greatly appreciated.

Thank you!

shinwano · November 22, 2010, 9:09pm

Hello.

I am trying to implement 1024 bit RSA encryption on CUDA. in order to do this, i need to perform wide operations such as 1024-bit wide multiplication . Can anyone help me? how can this be made parallel?

also, is there an efficient way to get the modulus for a wide division such as this (in parallel?). I mean like “2048bits”%“1024bits”.

Any advice would be greatly appreciated.

Thank you!

njuffa · November 22, 2010, 9:28pm

One thing one could do to parallelize the multiplication is to accumulate columns of partial products in parallel, then propagate the carries between columns at the end. If one were to do this at the HLL level rather than in PTX, multiplying 32x32->64 bits using integer multiply plus __umulhi() and accumulating in a 96-bit per-column accumulator seems a reasonable first cut. A single multiplication is unlikely to fill the GPU unless the numbers are much larger than 1024 bits, at which point one might want to go with FFT-based multiplication. There are many different ways of multiplying (e.g. Karatsuba, Took-Cook, FFT) and exponentiating (e.g. binary, sliding window) so some experimentation seems called for to find the switchover points between the various methods. It’s also a good idea to peruse the existing literature first, as there may already be published papers about RSA on the GPU.

Note that one typically does not compute a full modulo computation for the reduction step in RSA (cf. Montgomery and Barrett reductions). If you do need a full modulo operation, it could be mapped back to multiplication via Newton-Raphson based division and back-multiply.

[following added later]

Here is a failry recent paper that looks like it could be a reasonable starting point (I have not read this, I am just going by the abstract):

Owen Harrison, John Waldron. Efficient Acceleration of Asymmetric Cryptography on Graphics Hardware.

In: Proceedings AfricaCrypt 2009, June 21-25, 2009, pp. 350 - 367

Abstract: […] We show how a GPU implementation of a 1024-bit RSA decrypt primitive can outperform a comparable CPU implementation by up to 4 times and also improve the performance of previous GPU implementations by decreasing latency by up to 7 times and doubling throughput.[…]

njuffa · November 22, 2010, 9:28pm

One thing one could do to parallelize the multiplication is to accumulate columns of partial products in parallel, then propagate the carries between columns at the end. If one were to do this at the HLL level rather than in PTX, multiplying 32x32->64 bits using integer multiply plus __umulhi() and accumulating in a 96-bit per-column accumulator seems a reasonable first cut. A single multiplication is unlikely to fill the GPU unless the numbers are much larger than 1024 bits, at which point one might want to go with FFT-based multiplication. There are many different ways of multiplying (e.g. Karatsuba, Took-Cook, FFT) and exponentiating (e.g. binary, sliding window) so some experimentation seems called for to find the switchover points between the various methods. It’s also a good idea to peruse the existing literature first, as there may already be published papers about RSA on the GPU.

Note that one typically does not compute a full modulo computation for the reduction step in RSA (cf. Montgomery and Barrett reductions). If you do need a full modulo operation, it could be mapped back to multiplication via Newton-Raphson based division and back-multiply.

[following added later]

Here is a failry recent paper that looks like it could be a reasonable starting point (I have not read this, I am just going by the abstract):

Owen Harrison, John Waldron. Efficient Acceleration of Asymmetric Cryptography on Graphics Hardware.

In: Proceedings AfricaCrypt 2009, June 21-25, 2009, pp. 350 - 367

Abstract: […] We show how a GPU implementation of a 1024-bit RSA decrypt primitive can outperform a comparable CPU implementation by up to 4 times and also improve the performance of previous GPU implementations by decreasing latency by up to 7 times and doubling throughput.[…]

Rajkumar_T · December 20, 2010, 6:39pm

I have an algorithm to find the multiplication of N * N numbers wit horder of 2N-1. Sample result of my algorithm will look like below. Contact me if you need more details about that.

Start Time :1292869683418

345345534565687678879796867878676583746578324758237589724359874329587239857235729837598237459437598234759823759823759824759283455328658723465865723498237423748732648735687243658734875677567546757674674567553453245823453248572389475238547982357208397598203758234759082304759283792387458923758924752398475238957293857238957239857329485728395723987523985723984753249875293857394875398823475908230475928379238745892375892475239847523895729385723895723985732948572839572398752398572398475324987529385739487539882347590823047592837923874589237589247523984752389572938572389572398573294857283957239875239857239847532498752938573948753988234759082304759283792387458923758924752398475238957293857238957239857329485728395723987523985723984753249875293857394875398823475908230475928379238745892375892475239847523895729385723895723985732948572839572398752398572398475324987529385739487539882347590823047592837923874589237589247523984752389572938572389572398573294857283957239875239857239847532498752938573948753988234759082304759283792387458923758924752398475238957293857238957239857329485728395723987523985723984753249875293857394875398823475908230475928379238745892375892475239847523895729385723895723985732948572839572398752398572398475324987529385739487539882347590823047592837923874589237589247523984752389572938572389572398573294857283957239875239857239847532498752938573948753988234759082304759283792387458923758924752398475238957293857238957239857329485728395723987523985723984753249875293857394875398823475908230475928379238745892375892475239847523895729385723895723985732948572839572398752398572398475324987529385739487539882347590823047592837923874589237589247523984752389572938572389572398573294857283957239875239857239847532498752938573948753988234759082304759283792387458923758924752398475238957293857238957239857329485728395723987523985723984753249875293857394875398823475908230475928379238745892375892475239847523895729385723895723985732948572839572398752398572398475324987529385739487539829038570293847598234759823475908230475928379238745892375892475239847523895729385723895723985732948572839572398752398572398475324987529385739487539845793579857678567856345345354367567456475648756874235687342562873465872365872345628735682734568723568375638745637465873246578234652783562387465738456273568237456283756873256286587456823568734568324658234658237456823753845739283578934759823753726578236578632857634785635638475638756382475687246587235687345687326534745234567234856874562783567238653487564375643756643753784582375982347592843759438574982759823758973425894723598427358924375892347582934758234953495789453498759438758924952435823495 [font=“Arial Black”]X[/font] 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000034534553456568767887979686787867658374657832475823758972435987432958723985723572983759823745943759823475982375982375982475928345532865872346586572349823742374873264873568724365873487567756754675767467456755345324582345324857238947523854798235720839759820375823475908230475928379238745892375892475239847523895729385723895723985732948572839572398752398572398475324987529385739487539882347590823047592837923874589237589247523984752389572938572389572398573294857283957239875239857239847532498752938573948753988234759082304759283792387458923758924752398475238957293857238957239857329485728395723987523985723984753249875293857394875398823475908230475928379238745892375892475239847523895729385723895723985732948572839572398752398572398475324987529385739487539882347590823047592837923874589237589247523984752389572938572389572398573294857283957239875239857239847532498752938573948753988234759082304759283792387458923758924752398475238957293857238957239857329485728395723987523985723984753249875293857394875398823475908230475928379238745892375892475239847523895729385723895723985732948572839572398752398572398475324987529385739487539882347590823047592837923874589237589247523984752389572938572389572398573294857283957239875239857239847532498752938573948753988234759082304759283792387458923758924752398475238957293857238957239857329485728395723987523985723984753249875293857394875398823475908230475928379238745892375892475239847523895729385723895723985732948572839572398752398572398475324987529385739487539882347590823047592837923874589237589247523984752389572938572389572398573294857283957239875239857239847532498752938573948753988234759082304759283792387458923758924752398475238957293857238957239857329485728395723987523985723984753249875293857394875398823475908230475928379238745892375892475239847523895729385723895723985732948572839572398752398572398475324987529385739487539882347590823047592837923874589237589247523984752389572938572389572398573294857283957239875239857239847532498752938573948753982903857029384759823475982347590823047592837923874589237589247523984752389572938572389572398573294857283957239875239857239847532498752938573948753984579357985767856785634534535 [font=“Arial Black”]=[/font] 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011926353824446058339993049233459020343248083794719632934477661723339758938840896850832618117766707408490864091381298947710908960938403935699523903898242726939885004026940010369503626835322486657670785486646036208398845088866467819637627663676429535225138766296578562793883554373415538418660097181560581273432745262628535957989255812665103867955560292393956353551434342691661799509557765552511862220138653397326152682895006380226128679581843647233317803242600894828822449853575337925011427683450519122081627287192128699337065858691109655935661253455019867678186980032708123844961707903640548302288853467506342251588362562572837651375079899904705861972006411273494505799027482651146817581325859380678522445381773419164517387184322373378809612922267899080425897111593947801735814371158139544547270946331552581646464844794897254871997318278349880588902189411253246038958827851439571428623319742501727442810118892963851463914439799048153479710365172665749945645921478330943908594658516861111662542891297110026753548009022328130418312197197324820073424644283814231293625085560601269156001512167382892697037590839761942920299745026291064476404379448351655561250509826449375540530866281311461621365385462754374741087650167139453460932999341271958631036034157945359800785666389026929128129992041674373859730418119979388022221486538183388202408359598963721810691470634962523209756568678427196113338794288830213120900143841119366637420810881261390790442680092930229605170248398209752808300714272918331590972156231841214811094967496751342880108351480134730179980976458397935873670636752163896950453690004732454745436504958434220628744462096281828831050611048137367020231884833279661788086479065679270307326421130483338388254018848962866080158262503528216027330100844217655260447242114044602962862165308630557807531229156495324758916704457698173113903722468025761666602916363436918021399206025242741583337941435250390253584728929493680641455174506361878519129586706250740240556003976491531737958237405223327296826324506037549086088595147518270099058517563205840379971076494297529947392767293518267641749187791075227836560700741837645594934941933400329128211564327935738188485528135517554566103027027207978729975033693956083456004129087129530212188377930375630834862168543702111266271818613742012382712581495920768760531986270328158466589382390492084483152127956539034306076610277748870539333517976819308812266636596042583055445547173870064957250381919580977425060668978727967773906263877635863877646424887726650513340850745355789483005746099603808589470615734805705299636134610519179091692824082684476747764330087883372193424990372502443204062951399414272590534990710451566063057675342157743858973265706037763458510172701436130988952793525605352969320922258043942845397353238198943329367757974015619773494202619607135816060002030981889846761423983244132916771601182409589648346153845715633361415797253946580197339358644452714661992936011940236183407643261397418051009585484752946506847659781043480178137601323456224624030111902367925766567666378833566310961342337194215713305185203308376250774791301379285607099377042816488974847971997900908541638187157783179919946831239566642262908348036481308584604426687494740059103412746392052943277598128496013593581165017935468026518407877378318312702459577665342968246235354104583869033304415185504072769535485554517661749197534312970236513024578008188921809997643864463072179519866645964565766706007830927762394008089700223519779633190374099573858221902458111490954736246957519177920754989709487411308196675763230690987660520346862145757466085003976260571782529643297279669037361919933060701469309614617484174559297266760466440668693000480530472025760094805211412319998560054318024602187663717351588447802871702592931392347731717627361962358477980206166980967789196217799948550339584707803149672248257887095141099770673627315941375009911571585373400168683726799499027991376826709263553033938839732109119864827437429948849108397173006343316961209578710246285240570170129469363825728061891406311560342037225568452538765629707951152527207611343999346539012545376911747431589908711686372811918362881885690307067539953907805022361022427823231433272353255527903212320773832017322916290549517988386455441133003810633456387731007734450378423159279352815261960820444347183897557757870785371817645110204673079727612754551957700160869427593072055926288292249721841214741211004059843156043437116967570943840024946066183026373165688707762417190534394521684968104067602887500296913619791116102965140005733829850271837937691979454914601534624253475793718963940986545474189283984179742309781448535500103284081339809383667059035230347087984367144637034131286227974216493880931098489874175245753482113230668516775471557628542074890247500995642213462434150879065569462262344491925953741899825

End Time :1292869683598

Time Taken 180mSec for the size of 2634 X 2634

waltColquitt · December 21, 2010, 12:13am

If you would like to mutiply large mumbers, in the 1980’s I and Luther Welch searched for Mersenne Primes using the Lucus-Lehmer test. This involves squareing large integers, in our case about 33,000 digits (now consisidered trival). GIMPS can square a 33,000 digit number 110,502 times taking the remainder MOD 2^110503 in just 11 seconds. You need to learn about FFT mutiplication where you will be able to add each complex pair in complete parallel. I don’t know if little 1024 bit numbers (about 333 digits) are large enough to offset the to/from to the complex FFT. I think worth a try however.

walt

kang13 · March 15, 2011, 5:53pm

I have a basic question on implementing RSA using FFT: such as for rsa1024, should i do 1024 multiplication in frequency domain, finally inverse Fourier transform at the end, or i did inverse fourier transform immediately following fft for each multiplication? i know the first way may be wrong, at the same time i think overhead of doing FFt and then inverse FFt for each multiplication is big, want to reduce the overhead to minimum so that the intermediate multiplication is done in floating point only. pls comment if i am wrong. thanks!

jasonp · April 2, 2011, 2:12pm

It must be the latter. Multiplying two numbers is a convolution operation, and you have to transform back into the integer domain to preserve full accuracy in the product. There are alternative techniques like RNS (residue number systems) that avoid having to release carries across an entire multi-word integer, but you can’t use the FFT on them.

Niall_Emmart · August 4, 2016, 5:38pm

This is a very old thread, but there are now some very fast modular exponentiation routines in the XMP library (GitHub - NVlabs/xmp: CUDA accelerated(X) Multi-Precision library).

njuffa · August 4, 2016, 6:08pm

Wouldn’t it also be worthwhile to point to one or several of your relevant papers? I assume you are the same “Niall Emmart” who published on these topics a couple of years back …

Topic		Replies	Views
Parallelizing multiplication CUDA Programming and Performance	14	7883	October 11, 2008
32-bit number multiplication CUDA Programming and Performance	23	20843	July 1, 2012
Modular exponentiation & BigInteger CUDA Programming and Performance	12	8506	September 28, 2017
What properties does a problem suitable for CUDA have? CUDA Programming and Performance	4	7767	November 13, 2008
New number theory results found using CUDA CUDA Programming and Performance	13	25604	August 11, 2009
long-integer multiplication: mul.wide.u64 and mul.wide.u128 CUDA Programming and Performance	31	8262	January 2, 2018
BigInteger CUDA Programming and Performance	10	5922	July 15, 2011
error in modulo operation CUDA Programming and Performance	12	16251	September 20, 2009
Cuda 3.5 Integer Multiply Performance Is it really 3x slower than 64-bit floating point? CUDA Programming and Performance	21	20295	March 12, 2014
GPU sieving help Seeking GPU programming help for the sieves at PrimeGrid CUDA Programming and Performance	8	4323	March 26, 2009

Multiplication & remainders for large numbers.

Related topics