Register Limit? Compilation to .cubin using local memory

Is there a limit on the number of registers per thread? I’ve made sure my code doesn’t use local registers. my ptx output has no .local variables, but the cubin is reporting local memory usage. It also seems that the compiler refuses to allow more than 60 registers. The variables being moved into local memory are not arrays either they are just normal int variables.

I’ve looked around and the manual says 128 and people on here have said around 300-400…

Anyway to ensure the compiler adheres the ptx code without ‘optimising’ variables into local memory?

The limit is 128. To remind the compiler of this fact, pass “-maxrregcount=128” to it

Maybe decuda can shed some light on this. There you can see what variables are in local memory in the cubin (e.g. they were put there by ptxas).

It might be things like blockDim & gridDim. According to the nvcc documetation they are in local memory (it’s on one of the last pages)

blockDim and gridDim are in shared memory

Yep, I remembered wrongly, it is the index information that is in local memory according to the doc (although I would guess that blockIdx is also in shared memory, as it is the same for all threads in a block, and I would expect the threadIdx’s to be in registers, so I personally am guessing the documentation is wrong, but who knows)

I would really like to have some confirmation from an NVIDIA guy as to what is the reality. How it is in the doc, or how everybody has been thinking it is?

That doesn’t sound right. The 2nd smem/lmem number is usually as large as the 1st.