512 Bit unsigned integer

Leteraxx · May 4, 2021, 7:56am

Has anyone come across a lightweight implementation of a 512-bit unsigned integer that can be used on the device?

My problem with the implementations I have found (if they work at all :'( )is that they will often be very inefficient, especially when it comes to memory. Since using compute 6.1 I only have 255 registers per thread I can’t afford to use too many big ints per thread.

If someone can provide some insight from their own endeavors or give me some ideas on how to proceed I would be very thankful!

Robert_Crovella · May 5, 2021, 4:02am

Here are a few resources:

question on SO
question on these forums, I imagine there is a straightforward extension of the 256-bit ninja stuff there to 512 bits (here is 128 bit version)
xmp/cgbn

If it were me, I would start with xmp/cgbn, and then look at the ninja stuff by njuffa only if needed.

I wouldn’t start out on this road by worrying about register usage. You don’t have direct control over that anyway. Assuming that registers limit your options is not supportable. And I don’t know how you assess efficiency without building test cases. Unless you are a compiler genius.

Leteraxx · May 5, 2021, 7:56am

I am currently using my fork of https://github.com/blawar/biginteger/blob/master/integer.h, with some modifications to get it running on cuda/compile in the first place. I am targeting workgroups of 1024 Threads to maximize utilization. The problem is that the operations I need (powmod) end up using too many registers. (at least that’s what nvcc tells me)
I have looked into the SO and forum post, and yes they are definitely good options if I end up implementing it myself. I was hoping someone else had come across this problem before and had a handy solution.
I have looked into xmp/cgbn, but they both seem kind of overkill for my problem.

Thanks for taking your time to answer my question.

Robert_Crovella · May 5, 2021, 1:45pm

This is controllable. Just use the tools nvcc offers to limit registers per thread usage. Yes, it will probably have perf implications, but that is a logical starting point. I would evaluate that first if I had made that kind of investment in a code base.

Leteraxx · May 6, 2021, 6:38am

How would that work? if I limit the number of registers, won’t the computation just fail in the first place? I mean it used too many registers per thread, not too many registers in total.

Robert_Crovella · May 6, 2021, 1:08pm

You said this:

Can you be more specific? Specifically, can you copy and paste the nvcc output you are referring to into this thread?

For some background, you may want to read this including the comments.

Leteraxx · May 6, 2021, 4:23pm

Sure! Here is the output:
Fatal error: kernel (too many resources requested for launch)
Adding the -Xptxas -v option tells me it uses 249 registers per thread.
Maybe I understood the error wrong. Maybe I’m running into the 65k max registers limit not the 255 registers per thread. But in either case, using 249 registers seems super inefficient, but maybe that’s just the way it is? I assume this comes from stuff like having a bunch of temp variables in the computation? or are those released once they go out of the function scope? I don’t really have a lot of experience with this low level gpu programming.

Robert_Crovella · May 6, 2021, 4:40pm

So if you add -maxrregcount 64 to your compile command line, you should be able to make the fatal error go away. Yes, there may be a performance implication, but since your previous attempt was not running at all the comparison may be moot or have to be considered carefully.

If I had invested some time in a code base and ran into that error, that is what I would do first, before throwing everything out and starting over.

Topic		Replies	Views
Register usage of a device function for vector rotation CUDA Programming and Performance	14	704	June 12, 2022
How many registers could one thread use at most? When would the data be spit out to local memory? CUDA Programming and Performance	22	8908	July 1, 2010
Anyone help me with the link to the register usage break points? CUDA Programming and Performance	4	329	July 14, 2022
reducing the number of used registers CUDA Programming and Performance	8	6307	September 22, 2009
NVCC chooses to use local memory while there is a lot of registers it can use CUDA Programming and Performance	10	1461	January 7, 2022
Nvfortran 、 NVIDIA RTX A4000、registers nvc, nvc++ and nvfortran	18	1082	August 1, 2023
Ways to reduce number of registers CUDA Programming and Performance	14	19916	January 16, 2008
CUDA FORTRAN/OpenACC "Overflow" Register with maxr Legacy PGI Compilers	5	6156	February 5, 2014
109 registers per thread,should I be worried? CUDA Programming and Performance	1	597	July 20, 2014
How to determine number of register per thread How to determine number of register per thread from a CUDA Programming and Performance	4	13418	May 13, 2021

512 Bit unsigned integer

Related topics