Weird SM scheduling policy on GTX570

When I do a experiment on GTX570 with CUDA, the scheduler for SM makes a strange decision.

GTX570 has 15 SMs.
I use 15 thread blocks, each one has 256 threads.
When I use Visual profiler to profile the kernel, the kernel latency graph shows the SM0 used twice, but SM6 never used and other SMs used once.
The graph looks like:

Multiprocessor SM0 SM1 SM2 SM3 SM4 SM5 SM6 SM7 SM8 SM9 SM10 SM11 SM12 SM13 SM14
Utilization___ 100% 50% 50% 50% 50% 50% 0% 50% 50% 50% 50% 50% 50% 50% 50% 50%

I don’t understand why it skip the SM6 since there are 15 blocks.

Also, when I run 13 blocks, SM6 and SM14 are not scheduled
when I run 14 blocks, only SM6 is not scheduled
However, when I run more than 15 blocks, every SM works on at lease one block.

So question is why the 15 blocks cannot be scheduled on all 15 SMs? This activities doubles the kernel running time.

Please help!

Do these threadblocks have extremely short execution time?

The compute work distributor algorithm is not documented and changes between architectures. The GTX 570 is not a perfect chip. It is missing 1 SM. This causes oddities in the scheduling. If you want an optimal launch (1 block per SM) you can achieve this by using dynamic shared memory allocation to force this behavior. For example if you allocate 24*1024+1 byte of shared memory then only 1 block will be able to be allocate to each SM.