Error compiling kernel function

This is the error that I am getting:

“/tmp/tmpxft_000016d7_00000000-5.i”: Warning: Olimit was exceeded on function _Z25estimate_kernel_optimisedPfS_S_S_S_S_S_S_S_S_S_S_PiS_S_S_f; will not perform function-scope optimization.

To still perform function-scope optimization, use -OPT:Olimit=0 (no limit) or -OPT:Olimit=22672

Assertion failure at line 2385 of …/…/be/cg/NVISA/cgtarget.cxx:

Compiler Error in file /tmp/tmpxft_000016d7_00000000-5.i during Register Allocation phase:

ran out of registers in float

*** glibc detected *** /usr/local/cuda/open64/lib//be: free(): invalid pointer: 0x00000000014d1cc0 ***

======= Backtrace: =========

/lib64/libc.so.6[0x3d3b272832]

/lib64/libc.so.6(cfree+0x8c)[0x3d3b275f2c]

/usr/local/cuda/open64/lib//be[0x573d30]

/usr/local/cuda/open64/lib//be[0x597b88]

/lib64/libc.so.6(exit+0x109)[0x3d3b234029]

/usr/local/cuda/open64/lib//be[0x6881da]

/usr/local/cuda/open64/lib//be[0x514a3f]

/usr/local/cuda/open64/lib//be[0x514bed]

/usr/local/cuda/open64/lib//be[0x525920]

/usr/local/cuda/open64/lib//be[0x419bcd]

/usr/local/cuda/open64/lib//be[0x419fbd]

/usr/local/cuda/open64/lib//be[0x41b1f3]

/lib64/libc.so.6(__libc_start_main+0xf4)[0x3d3b21e074]

/usr/local/cuda/open64/lib//be[0x417659]

======= Memory map: ========

00400000-0080d000 r-xp 00000000 fd:05 393234 /usr/local/cuda/open64/lib/be

00a0c000-00c5e000 rw-p 0040c000 fd:05 393234 /usr/local/cuda/open64/lib/be

00c5e000-0315d000 rw-p 00c5e000 00:00 0 [heap]

3d3a000000-3d3a01b000 r-xp 00000000 fd:05 35029046 /lib64/ld-2.7.so

3d3a21a000-3d3a21b000 r–p 0001a000 fd:05 35029046 /lib64/ld-2.7.so

3d3a21b000-3d3a21c000 rw-p 0001b000 fd:05 35029046 /lib64/ld-2.7.so

3d3b200000-3d3b34d000 r-xp 00000000 fd:05 35029048 /lib64/libc-2.7.so

3d3b34d000-3d3b54d000 —p 0014d000 fd:05 35029048 /lib64/libc-2.7.so

3d3b54d000-3d3b551000 r–p 0014d000 fd:05 35029048 /lib64/libc-2.7.so

3d3b551000-3d3b552000 rw-p 00151000 fd:05 35029048 /lib64/libc-2.7.so

3d3b552000-3d3b557000 rw-p 3d3b552000 00:00 0

3d3b600000-3d3b682000 r-xp 00000000 fd:05 35029071 /lib64/libm-2.7.so

3d3b682000-3d3b881000 —p 00082000 fd:05 35029071 /lib64/libm-2.7.so

3d3b881000-3d3b882000 r–p 00081000 fd:05 35029071 /lib64/libm-2.7.so

3d3b882000-3d3b883000 rw-p 00082000 fd:05 35029071 /lib64/libm-2.7.so

2aaaaaaab000-2aaaaaaad000 rw-p 2aaaaaaab000 00:00 0

2aaaaaaad000-2aaaaab7a000 r-xp 00000000 fd:05 1803203 /usr/local/matlab/sys/os/glnxa64/libstdc++.so.6.0.8

2aaaaab7a000-2aaaaac7a000 —p 000cd000 fd:05 1803203 /usr/local/matlab/sys/os/glnxa64/libstdc++.so.6.0.8

2aaaaac7a000-2aaaaac9b000 rw-p 000cd000 fd:05 1803203 /usr/local/matlab/sys/os/glnxa64/libstdc++.so.6.0.8

2aaaaac9b000-2aaaaacae000 rw-p 2aaaaac9b000 00:00 0

This happend when manually unrolling a loop. It looks like there is some toolchain error that triggers afterwards. What will change in the generated code when I add -OPT:Olimit=0 to my commandline?

Hmm, I had

for(int k=tid; k< 2048; k+=512)

{ sdata[k] += input1[k] * input2[k] + input1[k+256] * input2[k+256];

  mdata[k] = fminf(mdata[k], fminf(input1[k], input1[k + 256]));

}

turned it into 1 statement. Got the error.

Then change the code into :

sdata[k] += input1[k] * input2[k] + input1[k+256] * input2[k+256];

sdata[k] += input1[k+512] * input2[k+512] + input1[k+768] * input2[k+768];

sdata[k] += input1[k+1024] * input2[k+1024] + input1[k+1280] * input2[k+1280];

sdata[k] += input1[k+1536] * input2[k+1536] + input1[k+1792] * input2[k+1792];

(eq. for the fminf)

And still get the error. Now I am really puzzled…

I’ve seen similar error with CUDA 1.0 and really register-hungry kernel. It was triggered because nvcc run out of virtual registers. Supplying Olimit had no effect, upgrading to 1.1 helped.

You’re already using 1.1, I guess… so, you can try Olimit, it will probably increase some internal limit on virtual registers. Or you need to undo loop unrolling to decrease regiser usage.

I was only using 27 registers before the unrolling attempt (which I only did because I am getting wrong results, using MACROS is fun, but not always leads to working code ;))

If you can attach a full test app which reproduces this (with build instructions), I can investigate further.

I was thinking of putting this kernel in a test-app like you have in the SDK (for debugging the original unrolled code). When I am done, I will mail it to you.

The number of registers used by the unrolled kernel might just as well be 27, the problem is not the number of actual registers but the number of virtual registers. For each assignment in the ptx a new virtual register is allocated, so for a heavily unrolled kernel this can well go into the 10,000s.

I’ve only had this error in really crazy cases though in 1.1.

How do I give the option of no limit to nvcc?? I tried adding -OPT:Olimit=0 , but I got an error about redefenition of argument optimize

Then I tried changing -O3 into -OPT:Olimit=0, but got nvcc fatal : ‘PT:Olimit=0’: expected a number

The nvcc document has the word limit in it once for maxregcount.

[quote name=‘DenisR’ date=‘Jan 30 2008, 04:50 PM’]

How do I give the option of no limit to nvcc??

–opencc-options -OPT:Olimit=99999

or to get the tools more talkative

–opencc-options -v,-OPT:Olimit=99999

Though it did not help me to overcome the famous “ran out of registers bug” :(

I observed one thing regarding this - Initially my code used double precision (all kernel computations were in Double) and I did not run into this issue.
Once I tried using Single precision (floats instead of doubles) in my kernel I ran into this issue. Does it imply that for double precision arithmetic compiler omits certain optimizations? I do understand why compiler might want to leave double precision arith untouched but I just want to confirm this.

Note - My code heavily uses register and local memory in case of Double precision version.
Reg = 110 (without any limit) and Local mem = 2Kb.