Why add one block makes twice duration on GTX570

When I do a experiment on GTX570 with CUDA, the scheduler for SM makes a strange decision, and based on the results, I have a question. The experiments are presented below.

GTX570 has 15 SMs.

First EXP.:
Using 14 thread blocks, each one has 256 threads.
When I use Visual profiler to profile the kernel, the kernel latency graph shows graph result below.

Multiprocessor SM0 SM1 SM2 SM3 SM4 SM5 SM6 SM7 SM8 SM9 SM10 SM11 SM12 SM13 SM14
Utilization___ 50% 50% 50% 50% 50% 50% 0% 50% 50% 50% 50% 50% 50% 50% 50% 50%
(The SM6 has non-block. The other SMs have one block.)


Second EXP.:
Using 15 thread blocks, each one has 256 threads.
The graph looks like:

Multiprocessor SM0 SM1 SM2 SM3 SM4 SM5 SM6 SM7 SM8 SM9 SM10 SM11 SM12 SM13 SM14
Utilization___ 100% 50% 50% 50% 50% 50% 0% 50% 50% 50% 50% 50% 50% 50% 50% 50%
(SM0 has been assigned two blocks, and SM6 still non-block.)

The two experiments give me very different execute duration.
Second experiment has at lest twice duration compare to the first experiment.

My question is why the second experiment takes twice time? Does that mean in the second experiment the SM0 takes two blocks which have 16 warps, therefore the time consumption is twice compare to the first experiment?

The basic idea with CUDA is that you create so many blocks that singular issues like this don’t really matter at all.

If you really need to get the scheduling right for every block, you probably want to distribute work between SMs manually. But please don’t take the fact that I am mentioning this as any hint that you should be doing it.

Just queue so many blocks that a single unbalanced block in the end doesn’t matter.

If you do not have enough work in this kernel invocation to create enough blocks you can also launch independent kernels in different streams to fully load the GPU.

CUDA really is a throughput architecure, not a latency one.