I have CUDA application. When I compile the application with the following command
nvcc -arch=sm_20 -maxrregcount 60 -Xptxas -v -I. -I…/…/…/src/include -c -O -o hash-table-gpu.o hash-table-gpu.cu
with allows maximum 60 register for for the kernels. Here is the output.
…
ptxas info : Compiling entry function ‘_Z19compute_site_medianP4site’ for ‘sm_20’
ptxas info : Used 10 registers, 40 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14], 8 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z19compute_host_medianP4site’ for ‘sm_20’
ptxas info : Used 10 registers, 40 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14], 4 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z11flow_kernelP18rwGenericRec_V5_stlP4site5Table’ for ‘sm_20’
ptxas info : Used 47 registers, 4800+0 bytes smem, 88 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14], 12 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z11build_tableP4site5Table’ for ‘sm_20’
ptxas info : Used 16 registers, 72 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14]
…
The maximum number of registers used by the kernel is 47.
Then, I limit the maximum number of registers used by the kernel to 20, with the command
nvcc -arch=sm_20 -maxrregcount 20 -Xptxas -v -I. -I…/…/…/src/include -c -O -o hash-table-gpu.o hash-table-gpu.cu
Here is the compile output:
…
ptxas info : Compiling entry function ‘_Z19compute_site_medianP4site’ for ‘sm_20’
ptxas info : Used 10 registers, 40 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14], 8 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z19compute_host_medianP4site’ for ‘sm_20’
ptxas info : Used 10 registers, 40 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14], 4 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z11flow_kernelP18rwGenericRec_V5_stlP4site5Table’ for ‘sm_20’
ptxas info : Used 20 registers, 4800+0 bytes smem, 88 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14], 12 bytes cmem[16]
ptxas info : Compiling entry function ‘_Z11build_tableP4site5Table’ for ‘sm_20’
ptxas info : Used 16 registers, 72 bytes cmem[0], 12 bytes cmem[2], 8 bytes cmem[14]
…
Since the maximum number of registers was now limited to 20, there would be register spilling to the local memory. But I did not see any “lmem” from the above compiling output. I also use compute vidual profile to profile the cuda application, and did see operations on the local memory. But why I did not see any “lmem” from the compiling outputs? Could anybody help me out? thanks.
My GPU is Fermi 2050, and NVCC is 3.2. System is: Linux.
thanks