NVCC Compling question, where is the lmem?

I have a CUDA application. I first compiled the application with the following command

nvcc -arch=sm_20 -maxrregcount 60 -Xptxas -v -I. -I…/…/…/src/include -c -O -o hash-table-gpu.o hash-table-gpu.cu

which allows a maximum of 60 registers for for the kernels. Here is the output.
… …
ptxas info : Compiling entry function ‘_Z19compute_site_medianP4site’ for ‘sm_20’
ptxas info : Used 10 registers, 40 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14], 8 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z19compute_host_medianP4site’ for ‘sm_20’
ptxas info : Used 10 registers, 40 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14], 4 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z11flow_kernelP18rwGenericRec_V5_stlP4site5Table’ for ‘sm_20’
ptxas info : Used 47 registers, 4800+0 bytes smem, 88 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14], 12 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z11build_tableP4site5Table’ for ‘sm_20’
ptxas info : Used 16 registers, 72 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14]
… …

The maximum number of registers used by the kernel is 47.

Then, I limited the maximum number of registers used by the kernel to 20, with the command

nvcc -arch=sm_20 -maxrregcount 20 -Xptxas -v -I. -I…/…/…/src/include -c -O -o hash-table-gpu.o hash-table-gpu.cu

Here is the compile output:
… … …
ptxas info : Compiling entry function ‘_Z19compute_site_medianP4site’ for ‘sm_20’
ptxas info : Used 10 registers, 40 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14], 8 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z19compute_host_medianP4site’ for ‘sm_20’
ptxas info : Used 10 registers, 40 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14], 4 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z11flow_kernelP18rwGenericRec_V5_stlP4site5Table’ for ‘sm_20’
ptxas info : Used 20 registers, 4800+0 bytes smem, 88 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14], 12 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z11build_tableP4site5Table’ for ‘sm_20’
ptxas info : Used 16 registers, 72 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14]
… … …

Since the maximum number of registers was now limited to 20, there would be register spilling to the local memory. But I did not see any “lmem” from the above compiling output. I also used “compute visual profile” to profile the cuda application, and did see operations on the local memory. But why did not I see any “lmem” from the compiling outputs? Could anybody help me out? thanks.

My GPU is Fermi 2050, OS is Linux, NVCC 3.2.

thanks,

wwu

There are other options to reduce register pressure: E.g. less aggressive reordering of loads, reloading values from variables in shared or global memory, reevaluation of common subexpressions. These are often less costly than register spilling, so the compiler will likely try them first.

Thanks,

But as i said, I used the “Compute Visual profiler” to profile the application, and see the operations on the local memory. So why are there still operations on the local memory if the local memory is not used?

thanks

wwu