When I do a experiment on GTX570 with CUDA, the scheduler for SM makes a strange decision, and based on the results, I have a question. The experiments are presented below.
GTX570 has 15 SMs.
First EXP.:
Using 14 thread blocks, each one has 256 threads.
When I use Visual profiler to profile the kernel, the kernel latency graph shows graph result below.
Multiprocessor SM0 SM1 SM2 SM3 SM4 SM5 SM6 SM7 SM8 SM9 SM10 SM11 SM12 SM13 SM14
Utilization___ 50% 50% 50% 50% 50% 50% 0% 50% 50% 50% 50% 50% 50% 50% 50% 50%
(The SM6 has non-block. The other SMs have one block.)
Second EXP.:
Using 15 thread blocks, each one has 256 threads.
The graph looks like:
Multiprocessor SM0 SM1 SM2 SM3 SM4 SM5 SM6 SM7 SM8 SM9 SM10 SM11 SM12 SM13 SM14
Utilization___ 100% 50% 50% 50% 50% 50% 0% 50% 50% 50% 50% 50% 50% 50% 50% 50%
(SM0 has been assigned two blocks, and SM6 still non-block.)
The two experiments give me very different execute duration.
Second experiment has at lest twice duration compare to the first experiment.
My question is why the second experiment takes twice time? Does that mean in the second experiment the SM0 takes two blocks which have 16 warps, therefore the time consumption is twice compare to the first experiment?