512 Bit unsigned integer

Has anyone come across a lightweight implementation of a 512-bit unsigned integer that can be used on the device?

My problem with the implementations I have found (if they work at all :'( )is that they will often be very inefficient, especially when it comes to memory. Since using compute 6.1 I only have 255 registers per thread I can’t afford to use too many big ints per thread.

If someone can provide some insight from their own endeavors or give me some ideas on how to proceed I would be very thankful!

Here are a few resources:

  • question on SO
  • question on these forums, I imagine there is a straightforward extension of the 256-bit ninja stuff there to 512 bits (here is 128 bit version)
  • xmp/cgbn

If it were me, I would start with xmp/cgbn, and then look at the ninja stuff by njuffa only if needed.

I wouldn’t start out on this road by worrying about register usage. You don’t have direct control over that anyway. Assuming that registers limit your options is not supportable. And I don’t know how you assess efficiency without building test cases. Unless you are a compiler genius.

1 Like

I am currently using my fork of https://github.com/blawar/biginteger/blob/master/integer.h, with some modifications to get it running on cuda/compile in the first place. I am targeting workgroups of 1024 Threads to maximize utilization. The problem is that the operations I need (powmod) end up using too many registers. (at least that’s what nvcc tells me)
I have looked into the SO and forum post, and yes they are definitely good options if I end up implementing it myself. I was hoping someone else had come across this problem before and had a handy solution.
I have looked into xmp/cgbn, but they both seem kind of overkill for my problem.

Thanks for taking your time to answer my question.

This is controllable. Just use the tools nvcc offers to limit registers per thread usage. Yes, it will probably have perf implications, but that is a logical starting point. I would evaluate that first if I had made that kind of investment in a code base.

How would that work? if I limit the number of registers, won’t the computation just fail in the first place? I mean it used too many registers per thread, not too many registers in total.

You said this:

Can you be more specific? Specifically, can you copy and paste the nvcc output you are referring to into this thread?

For some background, you may want to read this including the comments.

Sure! Here is the output:
Fatal error: kernel (too many resources requested for launch)
Adding the -Xptxas -v option tells me it uses 249 registers per thread.
Maybe I understood the error wrong. Maybe I’m running into the 65k max registers limit not the 255 registers per thread. But in either case, using 249 registers seems super inefficient, but maybe that’s just the way it is? I assume this comes from stuff like having a bunch of temp variables in the computation? or are those released once they go out of the function scope? I don’t really have a lot of experience with this low level gpu programming.

So if you add -maxrregcount 64 to your compile command line, you should be able to make the fatal error go away. Yes, there may be a performance implication, but since your previous attempt was not running at all the comparison may be moot or have to be considered carefully.

If I had invested some time in a code base and ran into that error, that is what I would do first, before throwing everything out and starting over.