Compiler Error, ran out of registers, please help any suggestions?

Hi all,

I’ve been working on very large project and after the amount of code within a global function (including several inlined device functions) gets too large I get the following error:

Assertion failure at line 2320 of …/…/be/cg/NVISA/cgtarget.cxx:

Compiler Error in file /tmp/tmp_00006868-1.i during Register Allocation phase:

ran out of registers: 6

*** glibc detected *** /usr/local/cuda/open64/lib//be: free(): invalid pointer: 0x08a140b0 ***
Signal: Aborted in Register Allocation phase.
Error: Signal Aborted in phase Register Allocation – processing aborted
*** Internal stack backtrace:
/usr/local/cuda/open64/lib//be [0x8324736]
/usr/local/cuda/open64/lib//be [0x83247aa]
/usr/local/cuda/open64/lib//be [0x8325431]
/usr/local/cuda/open64/lib//be [0x8325500]
/usr/local/cuda/open64/lib//be [0x8325578]
/usr/local/cuda/open64/lib//be [0x8323e53]
/lib/ [0x401b19fe]
nvopencc INTERNAL ERROR: /usr/local/cuda/open64/lib//be died due to signal 4
make: *** [all] Error 255

I have tried writing several different global functions and every time that the code gets too large I encounter this error again. I have tried to pin down the error to something specific in my code, but so far I havn’t been able to come up with anything.

for example, suppose I have defined the following function elsewhere:

/* A very large function calling other device functions */
device void mydevfunc(void);

The following code will compile and run just fine:

global void myglobalfunc(void){



However I get the error above by just adding one more call to mydevfunc(), for example the following code would not compile:

global void myglobalfunc(void){



I have been trying to run the code as a single thread as follows:

    dim3 grid(1,1);
    dim3 threads(1,1,1);

(Although in the actual code i have several parameters to the global and device functions, and return values from the device functions)

My questions:

Does anyone have any idea as to what the problem is? I am currently working under the assumption that it is a compiler bug.

If it is a compiler bug are there any workarounds to get the code to compile? Is there a way to tell the compiler to use memory instead of registers for most variables as a possible workaround?

I’ve done a Google search and haven’t been able to come up with anything at all. Any help would be greatly appreciated. If desired, I can be more specific and provide actual code (although it is very long).

Hi, I experienced the same problem.

I had many for loops inside the code and I could solve it by deviding the code over several threads. Perhaps you have a for-next loop which you can split?

Does using constant help?


This is not about code size. I recall a thread on this board where someone said that the (binary after optimization) max code size is 2MB. You would be hard pressed to reach that limit.

Your problem is register pressure. You have to help the compiler by using less variables stored in registers. That is, try to make the variable working set needed inside a loop small. If you can’t, move variables to shared memory.


Thank you both very much for your time and you suggestions.

After reducing the number of variables used, I was able to execute more code! However, it is not enough and I am still unable to execute as much code as I would like.

I also tried making all of the local variables global variables and making them shared, but It did not seem to help at all, I would try constant but hardly any of my variables are constant.

Originally I was going to divide up the code into several separate global functions but that strategy was not very well suited for my application and I eventually gave up.

It seems strange to me that calling the same function with the same parameters several times would increase the register pressure more than calling the function once. This is what leads me to believe it is a compiler bug of some sort.

Any more suggestions or possible workarounds would again be greatly appreciated!

I suppose the function you are talking about is a device function, right? In this case, the register spill is expected as device functions are always inlined (see manual). This means in practice, that the compiler has to keep the variables of the calling routine around and provide new registers for the inlined code. Optimizing this after the inline is a pretty tough job for the compiler to figure out which of the registers can be shared between the calling and the inlined code. Compilers for the CPU in contrast have an easy job here, as the CPU has a stack.


My approach to dealing with register pressure has been to break the computation into a series of kernels, with global memory holding intermediate results. This can reduce the number of registers each intermediate kernel needs which will improve processor utilization and performance.

We’d like to get a test case which reproduces this problem. If you could give us a *.cpp3.i file (generated with nvcc -keep), then we could reproduce this failure. It would be preferable if you could send the largest case.


I suspect I am hitting the same issue. For loops calling device functions create problems. In my case my computer just hangs though. Works fine in emudebug, no warnings from compiler. Seems tricky to debug (at least for me :) ). The additional problem I have is the I need to access shared memory, so splitting it in different kernels does not work.