When I do a experiment on GTX570 with CUDA, the scheduler for SM makes a strange decision.
GTX570 has 15 SMs.
I use 15 thread blocks, each one has 256 threads.
When I use Visual profiler to profile the kernel, the kernel latency graph shows the SM0 used twice, but SM6 never used and other SMs used once.
The graph looks like:
Multiprocessor SM0 SM1 SM2 SM3 SM4 SM5 SM6 SM7 SM8 SM9 SM10 SM11 SM12 SM13 SM14
Utilization___ 100% 50% 50% 50% 50% 50% 0% 50% 50% 50% 50% 50% 50% 50% 50% 50%
I don’t understand why it skip the SM6 since there are 15 blocks.
Also, when I run 13 blocks, SM6 and SM14 are not scheduled
when I run 14 blocks, only SM6 is not scheduled
However, when I run more than 15 blocks, every SM works on at lease one block.
So question is why the 15 blocks cannot be scheduled on all 15 SMs? This activities doubles the kernel running time.