I wonder how does Fermi GPU schedule thread blocks. Is it in round-robin fashion? That is, it schedules the (i)th block to the (i % (num. of SMs))th SM. Let’s assume linear blocks.
it is not round robin. depends on SM load/free slots.
you can try to figure it out for your particular kernel by reading clock() when every thread block starts and dumping it to memory.