Reason for the observed speed-up

Hi all,

I am working on a problem which involves finding maximum partial sums (MPS) of all the diagonals of large number of variable sized matrices. The MPS procedure for a particular matrix is handled by one single block. The threads in the block work on multiple diagonals serially in a way that load balancing is achieved. When processing a particular diagonal the thread reads matrix data from memory. Number of rows of all the matrices is same. Number of columns vary from something like 100 to 700. There are in total more than 20,000 such matrices thus more than 20,000 blocks to be called.

Question: In my earlier implementation the matrices were NOT arranged by their SIZES thus consecutive blocks worked on matrices with large size differnce. This implementation timed to around .67 seconds.
Then I arranged all the matrices by their sizes in the CPU (before sending it on to Device mem) After this minor change the implementation time has come down to .34 seconds.
The values of the matrix are being read from either texture memory or device memory (I have tried both the implementations) and the improvement in timing is shown by both implementations.

I am not able to guess why there is such a huge timing improvement. Because arranging the data, according to me, should only allow blocks (with IDs close to each other) run for similar durations.
Shouldnt the scheduler handle the situation where one block in a group takes much more time than others? Should it not automatically distribute the work? The knowledge I had about the block scheduling was that as soon as a block finishes execution on an SM another block is allocated to that SM. Is there something more to it or is there some completely other reason for such a speed-up?

I have checked the results. Also number of instructions, memory fetches divergent branches etc are almost the same.

Kindly let me know what you all think…