Your graph jumps at 96, 192 ,288. I dont understand why you would call them as multiple of “6”. May b, you were referring it at a single MP level.
I understand your explanation. But let me restate as how I understand. I think our explanations are going to be similar.
When you schedule 16 blocks, I would assume that the GPU would schedule them straight on 16 Multiprocessors. So, the turn around time that you see here is the base minimum turn around time required to execute just 1 block. This remains constant until 96. This is explainable as you can run 6 Blocks concurrently within an MP and it looks like the GPU is effectively overlapping GLOBAL Memory access and computation. It seems to me that your GLOBAL memory access is what is determining your block-turn-around time. The computation is very minimal. Hence, the time taken remains constant for 1 block as well as 6 blocks running on the same MP (with 96 blocks running concurrently). Thats the reason that I can think of.
Now, what happens from 96 to 192? When the initial 96 blocks are over, another 96 blocks are created (or re-used). Obviously, the time taken is going to increase. As you can see from the graph, it is doubled. The reason is very obvious.
I understand your question on having multiple blocks. Since your occupancy is 0.25, you are actully using 6 WARPs which equals to 192 threads. So, Thats good in a way. Anyway, your application looks to be constrained only by global memory access and NOT by computation. So, this latency may NOT matter at all.
I am NOT sure about the latencies of “Switching Blocks” when compared to “switching WARPS”. If your occupancy is NOT limited by “registers” then you can try increasing WARPs per block and see if it matters.