Hi everyone. I’m wondering if there is some cuda library that implements arithmetic on integer of undefined size.

Interesting question, as far as I know, math like addition and subtraction and multiplication and division are all serial algorithms, at least as it’s teached on basic school ;)

The basic methods require “carries” and “borrows” to be transffered/worked away for example:

  •  1



A carry like that must go all the way through the “digit array” before the next operations can be done otherwise there is a theoretical risk of “carry overflow” if the carries are “accumulated”.

In practice perhaps some cheating could be applied by “summing the carries” and applieing the carries later… but this would from a math point probably be less correct, unless “carry summation overflow” is detected and prevented and handled when it’s necessary.

I am not much of a math guy, perhaps there are already “parallel addition/subtraction/multiplication and division algorithms” out there ! ;)

Multiplication has pretty much the same issue as addition, since normally it would require addition as well for the “carry digit” to be transferred.

Division is a somewhat other beast, that requires an even slower routine, and if I remember correctly probably requires multiple subtractions or something like that, and is also serial…

So at least the basic methods were probably not intended to be parallel… (then again the way kids write the carry above the others does show at least some parallelism ;)) but I can imagine the following simple algorithm which might be of some use:

  1. Perform additions and multiplications in parallel for each digit/position and such, store the transports.

  2. Perform carry transportation in parallel.

There could be two different algorithms or perhaps a combination but the later would get a bit tricky:

First version would be:

Perform step 1.

Repeat step 2 until all carries are solved = 0.

Second version would be the more risky one:

Perform step 1.

Perform step 2. (accumilate carries in integers)

Repeat step 1 and step2 at the same time, skip step 1 when done, but continue repeating step 2 until all carries are done/solved = all caries zero.

The second method is more risky if the integer cannot hold anymore carries. So 2^31 overflow or 2^32 overflow.

^ Perhaps that could be detected by checking (2^32-1)-255
^ When any carry reaches that value it’s time to switch to carry solving method where they are reduced somewhat… perhaps they don’t even need to be reduced to zero, but just away from the limit which would be good enough until the next close encounter is met.

So it seems somewhat doable, the above risky method assumes byte-based additions and multiplications, this could be further enhanced by word-based additions and multiplications, maybe even integer-based additions and multiplications if 64 bit integers would be used for carries and such…

(Currently I have no such need for such a library, but I can imagine it would be usefull for (new) encryption algorithms and perhaps hashing/integrity checking, why would it be usefull for you or others, can you give some examples of usage scenerio’s ?)

I need support for biginteger in order to write an algorithm for fast modpow execution. All this for a public-key cryptography.

In digital networks theory there are circuit designed to achieve speed in multiplication and addition, pushing the parallelism example are:
The lookahead carry unit for addition:

And Wallace tree for multiplication:

I’m new to cuda. But i think that emulating this strategy it is possible to achieve faster result than a normal cpu…

In summary there is no such library? :(

I don’t have time to implement all the core stuff of addition and multiplication.

I don’t know if such a library exists, have you tried google ? :)

Those links are pretting interesting, especially the addition one, haven’t looked into the multiplication one I might have my own idea for multiplication.

The addition one is kinda interesting and could perhaps be used to construct a “typical” cuda/parallel version.

Basically the link comes down to “a+b digit pairs” which generate carries by themselfes like 5+5 or 6+4 and “a+b digit pairs” which propagate potential carries like “4+5” and “2+7”.

I guess this leaves other “a+b digit pairs” as “carry consuming pairs” ;) (plus others see below ;))

Determining which digit pairs are “propagating” can be done in parallel then these pairs which are in sequantial order could give small little offsets with a pre-fix-sum scan or something like that.

This could then allow the “generating” carries to immediatly transfer their carry through the “propagate warp jump” ;) :) onto the final destination.

I guess “generating carrie pairs” are also “carry consuming pairs”. 9+9 = 18, one carry out, but no further propagating so that seems a safe statement.

So the parallel algorithm for addition would start with:

  1. Determining all propagates.

  2. Computing “pre-scan” / consume jumps.

  3. Determining all carry generating pairs.

  4. Transferring their carries in one jump to final destination for final addition.

So the overhead would be in the “scan”.

(I am not yet completely sure how to calculate the jumps properly with the scan, these details remain still a bit vague…)

I guess all non-jump pairs will simply be zero and mean nothing has to happen in step 4 for those pairs… though the jump could also be transferred first, and then every pair can see that carry too
and do the final thing themselfes… Further remaining question is: will each destination be jumped to just once, or would their be “conflicts”… there probably would not be any conflicts.

Thanks Skybuck, I googled but i didn’t find anything useful. :(

Can you point me some well written resource on CUDA programming, starting from a beginner perspective up to a more skilled one?

My thoughts on your question about a cuda tutorial are the following:

  1. At the present time CUDA is not for the faint harted, with that I mean it’s mostly ment for expert programmers who seek additional performance real bad and are willing to dive into the details of how the hardware works to achieve that. Having said that:

  2. All the CUDA tutorials I have seen so far are written pretty concrete and involve lot’s of C code. (I haven’t looked into higher level programming languages/libraries like brook or anything like that… (I am not interested in those because they would probably be to limiting for me).

  3. What is most important about understanding how to do cuda programming is the high level abstract concepts, like what is a kernel, what is parallel programming about, how does cuda divide the work, that sort of thing, once that is understood then one can delve into the lower level details, there are some power point presentations with some pictures trying to explain this… but those powerpoint presentations are focussed heavily on the hardware architecture, so still a bit too concrete.

  4. For now I find writing the cuda kernels to be the easy and fun part. Writing the plumbing around it is the harder part.

  5. I am currently implementing my own object oriented framework for Delphi to make the programming/plumbing a little easier, while still be quite powerfull and flexible.

  6. I could write a tutorial for noobs, but why would I do that ? What would I gain from that ? (Perhaps some fellow cuda collegees :)) But I do dream of writing some successfull software, so why should I help potential competitors ? :) Ofcourse the parallel programming world is huge but still ;) :) I would require a huge ammount of my time to write tutorials and I gain little from it.

Finally some questions to you:

  1. What exactly do you mean/understand with a “beginner” ? Does this mean a somewhat decent C programmer or perhaps an expert C programmer just looking to do cuda ? Or a total programming noob ?

Teaching how to program to noobs is something interesting but would require a lot of studieing by the student… so if you a noob C programmer I wish you lot’s of luck… I will try to help here and there… but don’t expect me to start writing a tutorial on how to program in C or C++… I think that’s somewhat beyond the scope of this… Unfortunately one has to be able to program pretty well in C/C++ to be good at cuda… none the less I have seen some noobs write cuda stuff as well… which is interesting and remarkable… I applaud their dare and bravery… it’s quite remarkable and fun too watch ;) :)

The best advice I can give you, is the following:

  1. Simply start and take it slowly, it requires many little steps, do a few steps each day.

For now the steps are:

  1. Installing the necessary software, this will require one or two days of research.
  2. Getting some basic kernels running of your own, if c/c++ is used this is pretty easy to do, requires one day or so.

So if you choose the c/c++ path then you will be up and running in 3 days or so, and then you can start to write some stuff.

  1. So simply start easy with little programming examples, start with “hello world” would be a good choice.

  2. Also read the programming guide extensively/many times, especially build-in variable sections and the extensions to C which make CUDA C.

This last part should have been a seperate document which would have been clearer, so that’s something that remains to be desired and better explained, but it’s also evolving it seems.

I will be around on these forums for a while I think… so simply start somewhere if you interested in it and if you have any questions, search google or this forum and when you really can’t figure it out come back here and ask a question ;) :)

Perhaps in a few weeks or so I would be able to write such a library for you, but then you’d have to pay me something like 1000 to 2000 euro’s for it ! ;) =D
(Also I am not yet sure if division or modula can be done in parallel, and multiplication also remains a bit vague but is probably doable.)
( I am guessing you will need division/module too… so there is some risk of “project failure”… would you need anything else like and/or/xor that kind of thing ?)

Also I was imagining a “dll” library which could be simply used from outside cuda for non-cuda programmers, but I think/guess this idea can safely be forgotton about because cuda api/pci express bus/memory copieing between cpu and gpu is very slow and would probably bottleneck the dll solution too much.

So if a library would be written it would have to be some library which can be used inside the kernels themselfes and your algorithm will probably have to be implemented as much as possible inside kernels as well to prevent pci express bus overhead/api calls and such ;) writing “in kernel” libraries is possible to but is easier to copy or something and I was hoping to then sell some more copies to others too ;) :) not sure if that would work out well ;) :)

I don’t think you would be interested in that so I will probably continue with my own projects ;) :)

There is one last possibility which you could work on:

  1. Simply write the encryption/decryption algorithm in high level/pseudo code. Simply assume 1d, 2d or 3d, or 4d arrays/volumes where the integers/elements can be and operate on them with 1d or 2d or 3d or 4d indexes (up to 6d).

So you could discuss your encryption/decryption algorithms on these forums in a high level view and then perhaps the people who understand cuda better could advice you on that… but perhaps you want to keep your encryption/decryption algorithms secret ? :)

If you do describe your encryption/decryption algorithms ideas then perhaps some more people might be interested in it and the chances of getting some help with the big integer math routines/libraries might increase… so far most encryption/decryption algorithms I have seen use and/xor/or and shl/shr and such/stuff like that so mimic some kind of high level/big integer math…

Can’t you port GMP?

Cuda Training Resources:

I’m a nuclear engineer / physics student. I taught myself cuda, C, C++, and basic fortran over the course of a summer internship, it isn’t quite as tough as Skybuck makes it out to be, but it can still be pretty complicated if you try to just jump into it.

I’d also recommend the book Cuda by Example by Jason Sanders and Edward Kandrot. A very good book that goes over all of the basics with example code.

Lol that requires some explaning from you ;)

Nobody/no beginner leans C/C++/Fortran and CUDA in one summer ! LOL. Unless you had previous programming experience ;)

I had some experience with IDL, Matlab, and a course in Python that I managed to get a D in. ;)

(I also go to MIT)

What happened is that I had an internship with General Atomics. I showed up and was talking to my adviser and I expressed an interest in CUDA. He knew nothing about it, I knew very very very little, but we determined that it might be beneficial to look into porting a monte carlo code over. The monte carlo code is a neutral beam injection simulation code used in tokamak simulations. It is about 50,000 lines of fortran, most of it fortran77. That code, Nubeam, can be found here if you are interested:

Needless to say I had my work cut out for me. I sat down, watched all the stanford lectures on CUDA, read over the CUDA programming guide, read Numerical Recipes in C, and printed out Fortran/C/C++ quick reference cards. I also managed to find some good resources on compiling fortran+cuda code so that you can call CUDA code written in C from Fortran.

I wouldn’t say that I’m super proficient with Fortran or C++, but I’m pretty competent with C and most of the CUDA features. I’m still trying to get the hang of all of the uses of Objects in C++, but most of those you can’t use for cuda code anyway. Fortran is pretty straight forward, the most difficult part of it is compiling / formatting. Although reading Fortran77 is a massive massive pain.

Ok… wandering a bit of topic here… but what in godsname are “ion species” are these some kind of plasma creatures (lol) ? Or just some kind of ion particles/micro blobs ? :)