Reason for the observed speed-up

sidxavier · July 22, 2009, 7:36am

Hi all,

I am working on a problem which involves finding maximum partial sums (MPS) of all the diagonals of large number of variable sized matrices. The MPS procedure for a particular matrix is handled by one single block. The threads in the block work on multiple diagonals serially in a way that load balancing is achieved. When processing a particular diagonal the thread reads matrix data from memory. Number of rows of all the matrices is same. Number of columns vary from something like 100 to 700. There are in total more than 20,000 such matrices thus more than 20,000 blocks to be called.

Question: In my earlier implementation the matrices were NOT arranged by their SIZES thus consecutive blocks worked on matrices with large size differnce. This implementation timed to around .67 seconds.
Then I arranged all the matrices by their sizes in the CPU (before sending it on to Device mem) After this minor change the implementation time has come down to .34 seconds.
The values of the matrix are being read from either texture memory or device memory (I have tried both the implementations) and the improvement in timing is shown by both implementations.

I am not able to guess why there is such a huge timing improvement. Because arranging the data, according to me, should only allow blocks (with IDs close to each other) run for similar durations.
Shouldnt the scheduler handle the situation where one block in a group takes much more time than others? Should it not automatically distribute the work? The knowledge I had about the block scheduling was that as soon as a block finishes execution on an SM another block is allocated to that SM. Is there something more to it or is there some completely other reason for such a speed-up?

I have checked the results. Also number of instructions, memory fetches divergent branches etc are almost the same.

Kindly let me know what you all think…

Thanks
Sid.

Topic		Replies	Views
Block scheduling: How does it effect overall timing? CUDA Programming and Performance	0	1587	August 14, 2009
Block and thread scheduling/ordering questions CUDA Programming and Performance	14	5776	November 15, 2007
Need explanation CUDA Programming and Performance	3	6441	February 22, 2008
Shared Memory Buffer CUDA Programming and Performance	1	2680	May 13, 2011
MPS has gotten really good, but can CUDA streams replicate the benefits? CUDA Programming and Performance	1	189	September 23, 2024
Grids and Threads question CUDA Programming and Performance	2	4421	August 7, 2007
Sharing data between blocks CUDA Programming and Performance	6	2613	January 29, 2015
Mapping of Thread Blocks to SMs CUDA Programming and Performance	1	1009	January 18, 2015
Speedup Trend With Increasing Blocks... Trouble Interpreting Results CUDA Programming and Performance	0	669	August 28, 2009
How do thread numbers affect processing time? CUDA Programming and Performance	2	726	January 26, 2016

Reason for the observed speed-up

Related topics