In my implementation different blocks work on datasets of different sizes thus blocks have large difference in execution time.
Now, I observed that I am getting 2x speed-up if i arrange the datasets by there sizes (in CPU). Ie when I have blocks of similar runtimes running together the total runtime is much better…
I am not able to guess the reason behind this speedup. Is it because of better block scheduling?