multi-precision integer arithmetic on CUDA

Hi all,

I’m working at research institute where we extensively use computations with arbitrary precision integer arithmetic (aka gmp library).

I wonder if there were any efforts to implement something similar on CUDA ?

basically I started developing my own multi-precision library but, as I’m sort of a newbie in
CUDA-programming that would be nice to share efforts with somebody else perhaps more experienced in this subject.

here is my thoughts (maybe wrong):
for carry propagation one can use modified scan primitive from cudapp,
for multiplication it’s possible to implement karatsuba or 3-way toom-cook algorithm in order to distribute computations over several thread blocks.

does anyone have any ideas on this issue ?

thanks

Resurrecting a zombie thread…

I am currently interested in doing this for fixed point arithmetics both in CUDA and for the Intel AVX instruction set. I would use shared memory to store large numbers (in the order of 256 to 8192 bits per number maybe). My primary goals would be implementing addition and multiplication so I can run deep zooms into the Mandelbrot set.

Actually there are two ways to parallelize big number arithmetics:

a) run a serial algorithm on each individual thread, but many threads in parallel

b) use the threads within a thread block to perform parallel addition or the Karatsuba multiplication algorithm (divide and conquer).

Implementing a) would definitely be easier, but it may bring some branch divergence in the carry propagation.

Implementing b) is hard because parallelizing carry propagation is tricky, as you noted.

Christian

Found some literature describing the parallel scan algorithm and (among other things) its application in carry bit propagation:

http://pages.cs.wisc.edu/~tvrdik/21/ps/Section21.ps

oh wow, this is course material from Spring 1999. This is way before CUDA.

PTX has instructions for add with carry, so use inline assembler code for multi-precision integer operations.

true if one uses 2^32 as basis. I am considering using 1E9 currently. hmm… still undecided on this.

Making some progress here.

My (currently CPU based) implementation uses legs (or “bigits”) of base 2^16 stored in 32 bit words. I intend to use the mul24/mul24hi instructions in CUDA (most of my hardware is Compute 1.x), which gives me 8 usable extra bits per leg that can accumulate multiple carry bits and hold an individual sign bit. I found that I can multiply two signed 1024 bit numbers correctly and I only need to propagate the carries in the final result (never during the computation). I also need to make sure the legs come out either all positive or all negative in the end.

I can run this in base 10000 or base 2^16, but of course base 2^16 is more efficient to implement (integer modulus with 10000 is expensive, and it is required after each multiplication). The next step would be to map the recusive multiplication steps onto the parallel CUDA architecture. I intend to use the CPU implementation to create some kind of dependency graph, and from that generate some PTX or CUDA code that does the entire multiplication.

The idea is that I can keep several 1024 bit numbers in shared memory, for example to implement a Mandelbrot set iteration that almost never touches global memory (only across kernel calls, and for storing computation results). Each thread block working on one pixel of the set. The biggest challenge will be avoiding bank conflicts in shared memory.

Christian

Do you need the extra bits to hold multiple carries? I’d think that the natural order of operations in performing the multiplication is multiply all “legs”, sum results, normalize. In that order, multiplication would always be on normalized numbers, and you could use the extra bits for more precision (or increased performance). You would still have 8 more bits that can keep multiple carries from the adds.

For my application (Mandelbrots) I can do fixed point arithmetics - I do not need a normalization step. For example the two first legs are declared as the integral part of the number, all other 62 legs of the 1024 bit number become the fractional part of the number. I would need to shift both operands by one leg before performing my multiplication. The result comes out correctly scaled. The multiplication result would have twice the number of legs of the individual operands. But I can simply use the most significant half (and not even compute the other half - saving some computational complexity).

Sorry for being imprecise. With normalization I meant the step of ensuring that each leg is in the supposed range and does not have any carries on it anymore.

Normalization (or let’s rather call it carry propagation) is rather costly to implement. So I delay that part, leading to increased performance. The danger is that this could cause intermediate overflows that exceed the extra bits in the legs. I tried multiplying large numbers with all bits set and the result came out correctly. Of course that is no formal proof of correctness yet. External Image

I’ve successfully unrolled the upper half of the recursive Karatsuba multiplication algorithm six levels deep and mapped this to thread parallel CUDA code (see attached hand drawn flow graph of the algorithm).

I see some similarities with algorithms such as an FFT - but with the added nastiness that per recursion I get 3 products of half size. So the amount of data grows by factor 3/2 with each iteration. I obtain 3^6 = 729 individual products from the 2 x 64 legs (limbs?) of the multiplication factors A and B after unrolling the recursion tree. To compute the product I will need 2729sizeof(int) = 5832 bytes of scratch space which fits nicely into shared memory.

The resulting CUDA code is surprisingly simple and uses 3^5 = 243 threads per block. I use linear memory to store the permutation patterns that arise from the recursion (which I bind to a texture and get cached access). This fits into the 8kb texture cache easily.

What remains to be done is to condense these 729 individual products into the final result. This means going back up the recursion tree and condensing the data by factor 2/3 with each iteration (according to the summation rules shown in the lower part of my sketch).

This will hopefully be one of the fastest implementations of bigint multiplication (for 1024 bit numbers) out there. I’d love to benchmark this against other implementations. I will post code when done.

UPDATE:I have finished unrolling the lower part of the recursion. First benchmarks show that I can do about 60000 1024 bit products per SM on my notebook GPU. On a GTX 260 with its 27 multiprocessors I should therefore reach more than 1.5 million products per second. Cool, that is MegaMuls performance for big integers.

As a final step I need to implement carry propagation (normalization). And maybe check how many bank conflicts I have in my shared memory access patterns.

Hmm, okay now I get about 4.6 million bignum multiplications per second on an nVidia GTX 260 (1024bit operands with 2048bit result). A good dual core CPU would be able to do the same using an optimized implementation such as the MPIR libraries. sigh.

Any possibility you’d share your code?

After more than 10 years, I don’t have the code anymore. And I believe the code was not really competitive with any of the following libraries - and it also did not scale to different sizes of numbers.

For high performance bignum operations in CUDA, look at XMP
http://nvlabs.github.io/xmp/

and XMP 2.0 (aka CGBN)

I’ve been using both libraries for use cases such as primality checks (Fermat tests etc).

I’ve since tried to use Tensor cores for big num multiplications and succeeded somewhat - but again the speed did not exceed the benchmark set by said libraries.

Thank you very much for your quick response. CGBN is limited to “smaller” numbers compared to the numbers I need to use it for, however, xmp1 seems like it could be an option. I tried to build it but it didn’t compile. I’m completely new to cuda so I didn’t dive too deeply into it as I wasn’t really sure what the constraints and design choices were, if they were still valid, and if it could be fixed to work with the current cuda release. I may put some effort into it, especially, since you have said its’ performance metrics are pretty hard to beat.

if I remember correctly xmp lacks logic to support Pascal, Volta, Turing, Ampere generation devices because it does not check for the device capabilities up to that level.

There are a few places where it checks handle->arch or __CUDA_ARCH__ where you may need to add support for your required architectures. If you use this for primality checks and cryptography (modular exponentation), it is feasible to run the tune.cu for these higher architectures as well, provided that you compile the device code also for these devices.

the choice of XMAD vs IMAD is a critical one that depends on the architecture too. This affects multiplication performance a lot.

1 Like