Block scheduling: How does it effect overall timing?


In my implementation different blocks work on datasets of different sizes thus blocks have large difference in execution time.

Now, I observed that I am getting 2x speed-up if i arrange the datasets by there sizes (in CPU). Ie when I have blocks of similar runtimes running together the total runtime is much better…

I am not able to guess the reason behind this speedup. Is it because of better block scheduling?

Please help