Launch Parameters for Persistent Threads on GTX 570

I am attempting to implement persistent threads as described in: https://mediatech.aalto.fi/~samuli/publications/aila2009hpg_paper.pdf
for my CUDA application running on a GTX570.

Major revision number:         2
Minor revision number:         0
Name:                          GeForce GTX 570
Total global memory:           1310272 kb
Total global memory:           1279 mb
Total shared memory per block: 49152
Total registers per block:     32768
Warp size:                     32
Maximum memory pitch:          2147483647
Maximum threads per block:     1024
Maximum dimension 0 of block:  1024
Maximum dimension 1 of block:  1024
Maximum dimension 2 of block:  64
Maximum dimension 0 of grid:   65535
Maximum dimension 1 of grid:   65535
Maximum dimension 2 of grid:   65535
Clock rate:                    1540000
Total constant memory:         65536
Texture alignment:             512
Concurrent copy and execution: Yes
Number of multiprocessors:     15
Kernel execution timeout:      Yes
Compute capability: 2.0.

–ptxas-options=-v gives the following output for my kernel:

ptxas info : Used 46 registers, 64 bytes smem, 48 bytes cmem[0], 24 bytes cmem[16]

This is the result of using NVidia’s CUDA GPU Occupancy Calculator:

I’m confused as how to interpret this information in regards to how many blocks I should be launching. Since I have 15 SMs and the number of active thread blocks per multi-processor is 5 then I should be launching 5 * 15 blocks with 128 threads per block? Any clarification would be much appreciated.

persistent threads generally persist
hence, i have found the easiest test to be the debugger - the debugger can show the number of thread blocks seated, and where