NVCC Compling question, where is the lmem?

wuwenji · March 2, 2011, 5:31pm

I have a CUDA application. I first compiled the application with the following command

nvcc -arch=sm_20 -maxrregcount 60 -Xptxas -v -I. -I…/…/…/src/include -c -O -o hash-table-gpu.o hash-table-gpu.cu

which allows a maximum of 60 registers for for the kernels. Here is the output.
… …
ptxas info : Compiling entry function ‘_Z19compute_site_medianP4site’ for ‘sm_20’
ptxas info : Used 10 registers, 40 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14], 8 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z19compute_host_medianP4site’ for ‘sm_20’
ptxas info : Used 10 registers, 40 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14], 4 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z11flow_kernelP18rwGenericRec_V5_stlP4site5Table’ for ‘sm_20’
ptxas info : Used 47 registers, 4800+0 bytes smem, 88 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14], 12 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z11build_tableP4site5Table’ for ‘sm_20’
ptxas info : Used 16 registers, 72 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14]
… …

The maximum number of registers used by the kernel is 47.

Then, I limited the maximum number of registers used by the kernel to 20, with the command

nvcc -arch=sm_20 -maxrregcount 20 -Xptxas -v -I. -I…/…/…/src/include -c -O -o hash-table-gpu.o hash-table-gpu.cu

Here is the compile output:
… … …
ptxas info : Compiling entry function ‘_Z19compute_site_medianP4site’ for ‘sm_20’
ptxas info : Used 10 registers, 40 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14], 8 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z19compute_host_medianP4site’ for ‘sm_20’
ptxas info : Used 10 registers, 40 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14], 4 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z11flow_kernelP18rwGenericRec_V5_stlP4site5Table’ for ‘sm_20’
ptxas info : Used 20 registers, 4800+0 bytes smem, 88 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14], 12 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z11build_tableP4site5Table’ for ‘sm_20’
ptxas info : Used 16 registers, 72 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14]
… … …

Since the maximum number of registers was now limited to 20, there would be register spilling to the local memory. But I did not see any “lmem” from the above compiling output. I also used “compute visual profile” to profile the cuda application, and did see operations on the local memory. But why did not I see any “lmem” from the compiling outputs? Could anybody help me out? thanks.

My GPU is Fermi 2050, OS is Linux, NVCC 3.2.

thanks,

wwu

tera · March 2, 2011, 5:34pm

There are other options to reduce register pressure: E.g. less aggressive reordering of loads, reloading values from variables in shared or global memory, reevaluation of common subexpressions. These are often less costly than register spilling, so the compiler will likely try them first.

wuwenji · March 2, 2011, 5:48pm

Thanks,

But as i said, I used the “Compute Visual profiler” to profile the application, and see the operations on the local memory. So why are there still operations on the local memory if the local memory is not used?

thanks

wwu

Topic		Replies	Views
NVCC Compling question, where is the lmem? CUDA Programming and Performance	5	1481	March 4, 2011
Puzzling register usage by nvcc nvcc appears to not use a freely available register CUDA Programming and Performance	4	1085	March 10, 2011
register count frustration CUDA Programming and Performance	4	4444	September 29, 2011
local thread memory & compiller CUDA Programming and Performance	12	2983	September 26, 2008
How to prevent nvcc from using local memory? CUDA Programming and Performance	16	22447	February 14, 2008
lmem -- heeeelp :) CUDA Programming and Performance	9	2995	October 14, 2008
Force a variable to be stored in a Register Is there any way to ensure a variable CUDA Programming and Performance	13	9053	May 21, 2010
Register usage Understanding -ptx and -cubin CUDA Programming and Performance	11	5416	July 24, 2007
NVCC ignores -maxrregcount=64: chooses 27 registers and high local memory overhead CUDA Programming and Performance	5	1853	March 28, 2016
Registers and local memory CUDA Programming and Performance	10	7927	August 9, 2010

NVCC Compling question, where is the lmem?

Related topics