I am attempting to implement persistent threads as described in: https://mediatech.aalto.fi/~samuli/publications/aila2009hpg_paper.pdf
for my CUDA application running on a GTX570.
Major revision number: 2 Minor revision number: 0 Name: GeForce GTX 570 Total global memory: 1310272 kb Total global memory: 1279 mb Total shared memory per block: 49152 Total registers per block: 32768 Warp size: 32 Maximum memory pitch: 2147483647 Maximum threads per block: 1024 Maximum dimension 0 of block: 1024 Maximum dimension 1 of block: 1024 Maximum dimension 2 of block: 64 Maximum dimension 0 of grid: 65535 Maximum dimension 1 of grid: 65535 Maximum dimension 2 of grid: 65535 Clock rate: 1540000 Total constant memory: 65536 Texture alignment: 512 Concurrent copy and execution: Yes Number of multiprocessors: 15 Kernel execution timeout: Yes Compute capability: 2.0.
–ptxas-options=-v gives the following output for my kernel:
ptxas info : Used 46 registers, 64 bytes smem, 48 bytes cmem, 24 bytes cmem
This is the result of using NVidia’s CUDA GPU Occupancy Calculator:
I’m confused as how to interpret this information in regards to how many blocks I should be launching. Since I have 15 SMs and the number of active thread blocks per multi-processor is 5 then I should be launching 5 * 15 blocks with 128 threads per block? Any clarification would be much appreciated.