I try to use CUDA_Occupancy_Calculator.
That Help tap is written below.
ptxas info : Compiling entry function ‘_Z8my_kernelPf’ for ‘sm_10’
ptxas info : Used 5 registers, 8+16 bytes smem
Let’s say “my_kernel” contains an external shared memory array which is allocated to be 2048 bytes at run time. Then our total shared memory usage per block is 2048+8+16 = 2072 bytes.
In my kernel case, the compile result shows below.
ptxas info : Compiling entry function ‘_Z14float_to_colorP6uchar4PKf’ for ‘sm_10’
ptxas info : Used 8 registers, 16+16 bytes smem, 44 bytes cmem[1]
ptxas info : Compiling entry function ‘_Z13PRINT_POLYGONPhPiiiii’ for ‘sm_10’
ptxas info : Used 16 registers, 32+16 bytes smem, 20 bytes cmem[1]
ptxas info : Compiling entry function ‘_Z14float_to_colorPhPKf’ for ‘sm_10’
ptxas info : Used 8 registers, 16+16 bytes smem, 44 bytes cmem[1]
PRINT_POLYGON is a kernel name.
the, How can I get the shared memory? The total amount of shared memory is 32+16 = 48 bytes?
Or, plus Global memory?
I allocated in VMEM following.
HANDLE_ERROR( cudaMalloc( (void **)&dev_IMAGE, sizeof(unsigned char)*512*512*3) );
PRINT_POLYGON<<<grid,block>>>( dev_IMAGE, dev_MEM, data->deviceID, 0, 1, 2);