I am attempting to implement persistent threads as described in: https://mediatech.aalto.fi/~samuli/publications/aila2009hpg_paper.pdf
for my CUDA application running on a GTX570.
Major revision number: 2
Minor revision number: 0
Name: GeForce GTX 570
Total global memory: 1310272 kb
Total global memory: 1279 mb
Total shared memory per block: 49152
Total registers per block: 32768
Warp size: 32
Maximum memory pitch: 2147483647
Maximum threads per block: 1024
Maximum dimension 0 of block: 1024
Maximum dimension 1 of block: 1024
Maximum dimension 2 of block: 64
Maximum dimension 0 of grid: 65535
Maximum dimension 1 of grid: 65535
Maximum dimension 2 of grid: 65535
Clock rate: 1540000
Total constant memory: 65536
Texture alignment: 512
Concurrent copy and execution: Yes
Number of multiprocessors: 15
Kernel execution timeout: Yes
Compute capability: 2.0.
–ptxas-options=-v gives the following output for my kernel:
ptxas info : Used 46 registers, 64 bytes smem, 48 bytes cmem[0], 24 bytes cmem[16]
This is the result of using NVidia’s CUDA GPU Occupancy Calculator:
I’m confused as how to interpret this information in regards to how many blocks I should be launching. Since I have 15 SMs and the number of active thread blocks per multi-processor is 5 then I should be launching 5 * 15 blocks with 128 threads per block? Any clarification would be much appreciated.