Any luck with GMP and CUDA? Anyone compiled GnuMP code in CUDA?

Hi all. I think I’m buying a 8800GT to add to my number-crunching box, especially if someone else has had some luck with compiling GMP (Gnu Multiple Precision) functions with CUDA.

If not, they can be handled quite easily in assembly (as easy as assembly gets at least, haha), since the main data type is a pointer followed by a chunk of integers to represent a very large integer.

It’s all integer calculations, so I realize it’s not going to take full advantage of the GPU, but right now I’m looking at 14 week+ runtimes (on my Merom-based MacBook Pro); any help is welcome. I’m hoping to get the total runtime of the program down to 1 week.

If anyone has any experience in getting GMP code (especially FFT multiplication or a modulo function) to run in CUDA, I think it could speed up calculations quite a bit.

Thanks in advance,
Kale //first post!!

You have a truly amazing amount of patience. I get bored waiting for 3-day long jobs to run.

I’m not aware of any GMP type calculations performed in CUDA. It’s not as simple as just recompiling for the GPU, as you need to rewrite the code into data-parallel GPU kernels.

I realize it would require a new algorithm (since it’s a different style of processor), but I was hoping someone had written a simple function or two (multiply via FFT or add-multiply, or even a modular power function). Vector processes could greatly speed up large number calculations such as add. I just really don’t want to translate the GMP data types into vectors, run the calculation, then translate them back if someone else already had that written. Oh well, time to start digging through the CUDA documentation.

And I haven’t let a simulation completely run yet on the laptop, I’m still just prototyping.

Thanks!

You can always keep an eye on http://www.nvidia.com/object/cuda_showcase.html#papers , there have been quite a few posted recently but I don’t see any on GMP type calculations yet.

I wouldn’t get too excited about GMP libraries, especially integer functions. They are optimized for single thread operations and the following is taken from Section 5.1.1.1 of the CUDA Programming Reference Guide 2.0.


4 clock cycles for:

  • integer add
  • bitwise operations, compare, min, max, type conversion instructions

32-bit integer multiplication takes 16 clock cycles.

Integer division and modulo operation are particularly costly and should be avoided
if possible or replaced with bitwise operations whenever possible:

I would say design your own routines to take advantage of this new technology.

Good luck.

I’ve been looking into some large-integer multiplication routines for some cryptography research…I bet that some of the faster algorithms (Karatsuba, Toom-Cook) would be great on CUDA if they were coded in optimized C or in PTX. These algorithms generally work by ‘divide-and-conquer’ which is just what CUDA excels at. As a bonus, I believe that they can be implemented with simple shifts and additions, so it should be possible to write some ‘calculation-intensive’ code here…

I guess I’ll add this to my list of CUDA code I want to write (which keeps getting longer and longer :lol:). If anyone else is trying to learn CUDA, this might be a neat project to try. Or, if nVidia ever plans to run another CUDA programming contest, perhaps they could run one to create an arbitrary-precision numerical library (large integer/float/decimal multiplication, addition, etc.)