512 Bit unsigned integer

Has anyone come across a lightweight implementation of a 512-bit unsigned integer that can be used on the device?

My problem with the implementations I have found (if they work at all :’( )is that they will often be very inefficient, especially when it comes to memory. Since using compute 6.1 I only have 255 registers per thread I can’t afford to use too many big ints per thread.

If someone can provide some insight from their own endeavors or give me some ideas on how to proceed I would be very thankful!

Here are a few resources:

  • question on SO
  • question on these forums, I imagine there is a straightforward extension of the 256-bit ninja stuff there to 512 bits (here is 128 bit version)
  • xmp/cgbn

If it were me, I would start with xmp/cgbn, and then look at the ninja stuff by njuffa only if needed.

I wouldn’t start out on this road by worrying about register usage. You don’t have direct control over that anyway. Assuming that registers limit your options is not supportable. And I don’t know how you assess efficiency without building test cases. Unless you are a compiler genius.

I am currently using my fork of https://github.com/blawar/biginteger/blob/master/integer.h, with some modifications to get it running on cuda/compile in the first place. I am targeting workgroups of 1024 Threads to maximize utilization. The problem is that the operations I need (powmod) end up using too many registers. (at least that’s what nvcc tells me)
I have looked into the SO and forum post, and yes they are definitely good options if I end up implementing it myself. I was hoping someone else had come across this problem before and had a handy solution.
I have looked into xmp/cgbn, but they both seem kind of overkill for my problem.

Thanks for taking your time to answer my question.

This is controllable. Just use the tools nvcc offers to limit registers per thread usage. Yes, it will probably have perf implications, but that is a logical starting point. I would evaluate that first if I had made that kind of investment in a code base.

How would that work? if I limit the number of registers, won’t the computation just fail in the first place? I mean it used too many registers per thread, not too many registers in total.

You said this:

Can you be more specific? Specifically, can you copy and paste the nvcc output you are referring to into this thread?

For some background, you may want to read this including the comments.