How to explain when increase blocks from 1 to n per MP, throughput suddenly drop at some point Same

bit_mapper · February 15, 2011, 7:35pm

Hi

I’m trying to examine the relation of performance and number of blocks. What I do here is increase blocks from 1/Multiprocessor to 8/Multiprocessor. The performance suddenly drop when I use 90 blocks, i.e. on average 6 blocks/Multiprocess given that GTX480 has 15 Multiprocessors. I don’t know how to explain this…
(The same kernel, in which I use both thread level parallelism and instruction level parallelism. 27 registers/thread is used)

Thanks

avidday · February 15, 2011, 8:14pm

What is the block size?

bit_mapper · February 15, 2011, 9:36pm

I’m also trying different block sizes. The big drops at block 90 are obvious when the block size is small(i.e. 32, 64, 96, 128, 160, 192, 224) and they are relatively flat when the block size are getting larger than that range.

avidday · February 15, 2011, 9:45pm

It is probably just an occupancy phenomena. With less than a critical number of blocks, the execution time is pretty close to the execution of one block. One every MP reaches full occupancy, increasing the block count will cause the execution time to jump to close to that of two blocks, because at least one block must wait until a MP has resources to launch it.

bit_mapper · February 15, 2011, 11:22pm

To be specific, the GTX480 has 15 MPs, each of which has 32 cores. What I’m doing is calling the kernel with 15 blocks first and assume each block will simply assigned to each one of the MPs and therefore the 15 blocks are running concurrently on all 15 MPs. After that, I try to increase the blocks to 30, 60, 90, and 120 so as to hide the memory access latency and at the same time try to find the threshold where the throughput saturates.

Basically, I guess as the increase of overall blocks, the blocks assigned to each of the MPs will increase accordingly(I think it’s also evenly). In this range, the throughput should be linearly up because of the gradually effective pipelining in terms of hiddening the latency. However, when all the cycles of the latency has been hidden by a certain number of blocks on each MP, the throughput is supposed to be saturated, meaning further increasing the blocks, no further acceleration gain and the throughput line, I guess, should be relatively flat. My confusion is I never though there should be a huge drop in the flat range. What I observed from data is that the throughput goes up at first linearly and when it’s about to go flat, it drops to a cancave when I call using 90 (6 blocks/MP) and 120 blocks overall and then come back up to the flat threshold when using 150 blocks.

Topic		Replies	Views
Do we need to be conscious of the number of MPs in our GPU? CUDA Programming and Performance	2	2379	June 4, 2012
Speedup Trend With Increasing Blocks... Trouble Interpreting Results CUDA Programming and Performance	0	673	August 28, 2009
odd block size performance results CUDA Programming and Performance	4	4923	February 14, 2008
Increase blocksize decreases performance CUDA Programming and Performance	6	1448	December 28, 2009
Strange performance relationship to grid dimension? CUDA Programming and Performance	1	971	November 26, 2009
understanding the trade-off between block size and occupancy CUDA Programming and Performance	1	14158	March 29, 2010
How are blocks distributed over the SMs? Strange scaling over the number of blocks in a kernel call CUDA Programming and Performance	15	2401	June 25, 2012
Mapping of Blocks to MPs / Threads to MPs CUDA Programming and Performance	1	605	November 19, 2013
CUDA perormances CUDA Programming and Performance	10	7136	January 22, 2008
Observation about performance change with change in grid size CUDA Programming and Performance	0	1454	May 19, 2009

How to explain when increase blocks from 1 to n per MP, throughput suddenly drop at some point Same

Related topics