I’m trying to launch multiple blocks to a single SM.
Device=GTX 470 CC=2.0
As per CUDA Occupancy Calculator I can launch 6 blocks per SM (Reg/thread=5, Shared Memory=22B, Threads/Block=256). But when I launch 6 blocks with 256 threads/block my GFLOPs will be 355. For GTX 470 the peak GFLOP for MAD per SM is ~70GFLOPs, which means other SMs are involved while execution.
How do I control the distribution of blocks to SMs, and is my speculation right?
If the scheduler was putting all 6 blocks on the same SM, it would not be doing its job properly. (Under normal circumstances, you want blocks distributed over all the SMs. Filling up one SM before moving to the next would tend to underutilize the device.) CUDA does not provide any simple interface to control block scheduling.
It might be possible to force the configuration you want by launching a lot of blocks (enough to fill the entire device), then having each thread check the %smid register using some inline PTX. Then the thread can decide to exit If the %smid register does not equal the ID number of the target SM. There is the possibility of a race condition between your threads and the block scheduler, so this still might not work.