Increase blocksize decreases performance

sgu · December 26, 2009, 8:41pm

Hello,
When I increase the blocksize from 128 to 256, I saw a decrease in the performace(from 1.5ms to 1.6ms). However, according to the compiling information, no local variables spill to local memory(they are still stored in registers), the occupancy is still 1.0. Any reasons are responsible for the performace decrease? Thank you.

Gregory_Diamos · December 26, 2009, 9:19pm

Try to get your application to run longer. 1.5-1.6ms in probably in the measurement noise of your system. A simple OS context switch during your application would be enough to account for the difference. It may not have anything to do with the kernel actually taking longer to execute.

Try putting a loop around your kernel, launching it many times, and taking the average time.

nitin.life · December 26, 2009, 9:49pm

Hello,

When I increase the blocksize from 128 to 256, I saw a decrease in the performace(from 1.5ms to 1.6ms). However, according to the compiling information, no local variables spill to local memory(they are still stored in registers), the occupancy is still 1.0. Any reasons are responsible for the performace decrease? Thank you.

From 1.0 occupancy you mean 100 % rite , I Guess you are not using shared memory or very small amount of it then ? …

also are your writes and reads from global memory coalesced ? cause I have seen that more number of independent thread blocks helps you better hide this global memory latency.

Jimmy_Pettersson · December 26, 2009, 11:22pm

yes, the rule is the more the merrier. As suggested you managed to hide the global memory latency better when you could fit more blocks on each SM.

sgu · December 27, 2009, 2:23am

I did lauch the kernel many times and the performance is the average.

You mean that more blocks are more preferable than bigger block size?

nitin.life · December 27, 2009, 3:42am

In a way… but you have to give more preference to occupancy first as that provides more bang.

Once reach an optimized value occupancy ~50% (single precision kernel) for CC1.3 devices then I guess more block size makes more positive impact performance; this I say from my experience.

Hence if both your implementations have same occupancy and register usage then more number of blocks should help you hide the memory latency better + it will also help you scale your applications to future devices like FERMI which have lot more number of cores.

But you should las keep in mind the register RAW latencies which depend on your kernel register usage and how you actually coded up the algorithm. RAW latencies are hidden once we have > 192 threads but they have a less impact on performance then memory latency hence they only affect algorithms which are compute bound. Looks like yours is memory bound.

Hope this helps

Tigga · December 28, 2009, 12:40am

Larger blocks will tend to be slower if you’re using __syncthreads. This may only be the case if you have branching which isn’t consistant across the block though.

Topic		Replies	Views
Occupancy wierdness.... Is the calculator wrong? CUDA Programming and Performance	5	5914	July 25, 2007
Strange performance by varying the block size CUDA Programming and Performance	6	716	February 21, 2020
too large kernel solutions CUDA Programming and Performance	11	4291	September 2, 2008
question about register and performance CUDA Programming and Performance	3	6714	September 22, 2008
How number of threads affect performance? CUDA Programming and Performance	8	5933	January 6, 2010
understanding the trade-off between block size and occupancy CUDA Programming and Performance	1	14158	March 29, 2010
Increasing register usage without decreasing occupancy drops speed dramatically CUDA Programming and Performance	3	972	May 24, 2011
latency hiding How much speedup can you get? CUDA Programming and Performance	3	9696	November 10, 2007
CUDA perormances CUDA Programming and Performance	10	7136	January 22, 2008
Occupancy Query Performance not as expected CUDA Programming and Performance	11	4462	February 3, 2009

Increase blocksize decreases performance

Related topics