I’ve successfully unrolled the upper half of the recursive Karatsuba multiplication algorithm six levels deep and mapped this to thread parallel CUDA code (see attached hand drawn flow graph of the algorithm).

I see some similarities with algorithms such as an FFT - but with the added nastiness that per recursion I get 3 products of half size. So the amount of data grows by factor 3/2 with each iteration. I obtain 3^6 = 729 individual products from the 2 x 64 legs (limbs?) of the multiplication factors A and B after unrolling the recursion tree. To compute the product I will need 2*729*sizeof(int) = 5832 bytes of scratch space which fits nicely into shared memory.

The resulting CUDA code is surprisingly simple and uses 3^5 = 243 threads per block. I use linear memory to store the permutation patterns that arise from the recursion (which I bind to a texture and get cached access). This fits into the 8kb texture cache easily.

What remains to be done is to condense these 729 individual products into the final result. This means going back up the recursion tree and condensing the data by factor 2/3 with each iteration (according to the summation rules shown in the lower part of my sketch).

This will hopefully be one of the fastest implementations of bigint multiplication (for 1024 bit numbers) out there. I’d love to benchmark this against other implementations. I will post code when done.

UPDATE:I have finished unrolling the lower part of the recursion. First benchmarks show that I can do about 60000 1024 bit products per SM on my notebook GPU. On a GTX 260 with its 27 multiprocessors I should therefore reach more than 1.5 million products per second. Cool, that is MegaMuls performance for big integers.

As a final step I need to implement carry propagation (normalization). And maybe check how many bank conflicts I have in my shared memory access patterns.