what is wrong with problem" ran out of registers in integer"

I recently wrote a code with that problem “ran out of registers in integer”, but I just don’t know how to deal with it.
Plz help

It has been a long time since I last ran into this kind of error message. As I recall it had to do with limitations inside the compiler, and can happen when the intermediate code gets very large (e.g. due to inlining, loop unrolling).

(1) What CUDA version are you using? I would suggest using the latest released version, which is 4.2
(2) What platform (compute capability) are you targeting ? I seem to recall this problem mostly affected sm_1x targets.
(3) Can you show the exact message from the compiler please (normally the compiler will print the name of the component that emits the message)?
(4) Are you calling heavy-weight math functions (e.g. pow(), tgamma()) much inside the code ?

The typical workaround would be to reduce the code size: reduce unrolling, reduce inlining of user functions with noinline [this requires sm_20 or higher], reduce number of heavy-weight math functions [which get inlined] in the code.

Thank you very much for your advice, njuffa. I’v tried to reduce the device functions that are called by the global as many as possible, and I extracted the codes of the device functions and put them directly into the global instead

of calling them. Now the code can be compiled and run normally. But still I don’t quite understand a few things of what you’ve posted.

(1) what does it mean by “sm_1x targets”? I’ve googled it but i just don’t get it. Maybe it’s because I use Visual Studio 2010 + Parallel Nsight 2.2 under Win7, but not Linux.

(2) In what circumstance that the NVCC compiler would assign variables to registers? As far as I know that in C++ programming, the CPU would release the content in registers as soon as the previous line of code is finished.

(3) I don’t know if it’s correct that I consider device function equals inlining, and some kind of loops (like FOR and WHILE ) equals unrolling?

lookin forward to your help, and thanks again/

Please note that nvcc is just a slim driver program which calls various compiler components who perform all the real work. sm_1x refers to compute capabilities 1.0, 1.1, 1.2, and 1.3. The nvcc compiler driver refers to these as sm_10, sm_11, sm_12, and sm_13 when specifying the target architecture with the -arch flag.

If I recall correctly, the error message you saw is emitted by the Open64 frontend, whose use is restricted to sm_1x targets in recent releases of CUDA. Open64 emits portable PTX code in SSA form (see: http://en.wikipedia.org/wiki/Static_single_assignment_form) which means each new result is assigned a new virtual register. The mapping to physical registers happens later during PTX to SASS (machine code) translation. PTX uses typed registers, and there is a upper limit on the number of each type of register (integer, float, double, predicate). When code gets very large, i.e. contains many PTX instructions, it is possible to exceed the maximum number of registers provided for some particular type, leading to the kind of error message you encountered.

The compiler backened, PTXAS, performs the allocation of physical registers and instruction scheduling, and uses live range analysis to control this process, which is what I think you are referring to when you mention “releasing registers”. When there are not enough physical registers available, the compiler will spill registers to local memory and later reload them from there. This process takes place after the stage that produces the error message you saw.

The compiler will in general inline small and medium sized functions, only large functions are made into called subroutines. This is done for performance as it avoids call overhead, but can lead to a large code size as the function body is copied to each site the function is called from. If the function is a called subroutine instead, there is only a sngle copy of the function body. You can force the compiler to not inline device functions with the noinline attribute, subject to architectural restrictions (some functions must be inlined on some architectures). Use of noinline will in general reduce the number of PTX instructions generated, and thus the use of virtual registers.

Loop unrolling refers to both completely unrolling a loop (typically a for loop with a trip count known a compile time) into straightline code, and also partially unrolling a loop (typically a for or while loop), where the loop remains a loop but the new loop contains multiple instances of the original loop body and the loop control is adjusted. Loop unrolling is performed by replicating the loop body as many times as needed, which can lead to large code if the unroll factor is high. The compiler uses heuristics to determine which loops should be unrolled and will not unroll loops that produce to much code. Programmers can also request unrolling via #pragma unroll. To prevent unrolling, use #pragma unroll 1.

That’s really helpful, and thank you