I have a CUDA application. I first compiled the application with the following command
nvcc -arch=sm_20 -maxrregcount 60 -Xptxas -v -I. -I…/…/…/src/include -c -O -o hash-table-gpu.o hash-table-gpu.cu
which allows a maximum of 60 registers for for the kernels. Here is the output.
… …
ptxas info : Compiling entry function ‘_Z19compute_site_medianP4site’ for ‘sm_20’
ptxas info : Used 10 registers, 40 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14], 8 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z19compute_host_medianP4site’ for ‘sm_20’
ptxas info : Used 10 registers, 40 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14], 4 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z11flow_kernelP18rwGenericRec_V5_stlP4site5Table’ for ‘sm_20’
ptxas info : Used 47 registers, 4800+0 bytes smem, 88 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14], 12 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z11build_tableP4site5Table’ for ‘sm_20’
ptxas info : Used 16 registers, 72 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14]
… …
The maximum number of registers used by the kernel is 47.
Then, I limited the maximum number of registers used by the kernel to 20, with the command
nvcc -arch=sm_20 -maxrregcount 20 -Xptxas -v -I. -I…/…/…/src/include -c -O -o hash-table-gpu.o hash-table-gpu.cu
Here is the compile output:
… … …
ptxas info : Compiling entry function ‘_Z19compute_site_medianP4site’ for ‘sm_20’
ptxas info : Used 10 registers, 40 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14], 8 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z19compute_host_medianP4site’ for ‘sm_20’
ptxas info : Used 10 registers, 40 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14], 4 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z11flow_kernelP18rwGenericRec_V5_stlP4site5Table’ for ‘sm_20’
ptxas info : Used 20 registers, 4800+0 bytes smem, 88 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14], 12 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z11build_tableP4site5Table’ for ‘sm_20’
ptxas info : Used 16 registers, 72 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14]
… … …
Since the maximum number of registers was now limited to 20, there would be register spilling to the local memory. But I did not see any “lmem” from the above compiling output. I also used “compute visual profile” to profile the cuda application, and did see operations on the local memory. But why did not I see any “lmem” from the compiling outputs? Could anybody help me out? thanks.
My GPU is Fermi 2050, OS is Linux, NVCC 3.2.
thanks,
wwu