Register Limit? Compilation to .cubin using local memory

trex · December 11, 2008, 1:15am

Is there a limit on the number of registers per thread? I’ve made sure my code doesn’t use local registers. my ptx output has no .local variables, but the cubin is reporting local memory usage. It also seems that the compiler refuses to allow more than 60 registers. The variables being moved into local memory are not arrays either they are just normal int variables.

I’ve looked around and the manual says 128 and people on here have said around 300-400…

Anyway to ensure the compiler adheres the ptx code without ‘optimising’ variables into local memory?

alex_dubinsky · December 11, 2008, 3:59am

The limit is 128. To remind the compiler of this fact, pass “-maxrregcount=128” to it

E.D_Riedijk · December 11, 2008, 7:10am

Maybe decuda can shed some light on this. There you can see what variables are in local memory in the cubin (e.g. they were put there by ptxas).

It might be things like blockDim & gridDim. According to the nvcc documetation they are in local memory (it’s on one of the last pages)

alex_dubinsky · December 11, 2008, 7:14am

blockDim and gridDim are in shared memory

E.D_Riedijk · December 11, 2008, 9:48am

Yep, I remembered wrongly, it is the index information that is in local memory according to the doc (although I would guess that blockIdx is also in shared memory, as it is the same for all threads in a block, and I would expect the threadIdx’s to be in registers, so I personally am guessing the documentation is wrong, but who knows)

A summary on the amount of used registers and the amount of memory needed per

compiled device function can be printed by passing option â€“v to ptxas:

nvcc -Xptxas â€“v acos.cu

ptxas info : Compiling entry function ‘acos_main’

ptxas info : Used 4 registers, 60+56 bytes lmem, 44+40 bytes smem,
                             20 bytes cmem[1], 12 bytes cmem[14]
As shown in the above example, the amounts of local and shared memory are listed

by two numbers each. The first number represents the total size of all variables

declared in local or shared memory, respectively. The second number represents the

amount of system- allocated data in these memory segments: device function

parameter block (in shared memory) and thread/grid index information (in local

memory).

Used constant memory is partitioned in constant program â€˜variablesâ€™ (bank 1), plus

compiler generated constants (bank 14).

I would really like to have some confirmation from an NVIDIA guy as to what is the reality. How it is in the doc, or how everybody has been thinking it is?

alex_dubinsky · December 11, 2008, 7:13pm

That doesn’t sound right. The 2nd smem/lmem number is usually as large as the 1st.