kernel function size limit? how many lines or variables are allowed?

hello, i’ve come up to a problem of running out of registers( maybe )
I tried to make a kernel function about over than 1000 lines, with over one hundred variables in it , than the NVCC reports “Olimit was exeeded on function my_func; will not perform function-scope optimization”…“ran out of registers in predicate”…
btw, i didn’t use any type qualifiers for any variable, does it means all the variable are stored in register?

Is there any specification about how long can a kernel function be ?


Maximum kernel size is around 2 million hardware instructions (according to Programming Manual). However, you’ll run out of registers much earlier (exactly what you’ve got).

You should try to redesign your kernel so that it uses less registers (compiler always tries to use registers where possible). This may be done by using shared or lcal(slow!) memory or by splitting big kernel into several smaller ones.

There is a much lower kernel size limit imposed by ptxas, something around 32767. However, I don’t know whether this gets fixed in 1.1.

32767 instructions?

Yes, you can run out of ‘virtual registers’ very soon, as it uses a new ‘virtual register’ for each assignment. This is a ptx limitation, not a CUDA one.

The real limit is said to be at 2Mb of shader instructions, which is 262144 64-bit instructions. (and ometimes two instructions can be stored in one 64 bit word)

Thanks AndreiB, it was really such a fat code than i also guess the gpu cannot deal with.

but I took a test: the code has about 200 lines for variables declaration and 800 lines for calculation, i cut 400 lines of the 800 lines’ calculation, then the compiling passed, only reports “compiler may run out of memory or run very slowly for large Olimit values”(it took 5 seconds). So i wander the pure calculation also consumes extra registers, right?

Not GPU, but compiler. I currently have similar problem and it seems I’ll be translating my code to PTX by hand. May this solve problem?

Yes, I’ve seen examples of this. You may try putting if( blockIdx.x < 0 ) { __syncthreads(); } somewhere as this sometimes reduces number of registers used by compiler :)

The compiler aggressively optimizes out dead code. If an entire long kernel (that doesn’t use smem) only results in a single global memory write, commenting out that write will cause the compiler to optimize away the entire kernel and leave you with a blank one. After you commented out 400 lines of code, nvcc probably optimized away a lot of the variables used.