increasing blokSize -> Faster or slower

pidanchen · August 27, 2011, 1:02am

A newbie question.

For a general image processing application, say the image is WxH. No matter how you determine the blockSize or gridSize, most likely, each pixel processing will be done in one thread. Does it mean that the performance of the algorithm (without considering shared memory access) will be pretty much the same with different blockSize and gridSize? As long as each pixel will occupy one thread.

thanks

droy · August 27, 2011, 1:44am

If I may quote from a book on CUDA - “The ability of synchronizing with each other also imposes execution constraints on threads within a block. These threads should execute in close time proximity with each other to avoid excessively long waiting times”. Also I did few experiments, based on which I can say smaller gridSize and bigger threadSize is the way to go!

D

Snowball_Two · August 27, 2011, 9:42am

more exactly:

it will be very slow with blocksize < 16, since it can’t coalesce the memory access,
will get faster, until the maximum threads per multiprocessor (e.g. 768 at a 9800GT) is a multiple of the blocksize,
and may get slower again when a too big blocksize leads to a lower occupancy of the multiprocessor.

on a 9800GT this means:

each MP can execute max. 8 blocks (has 8 “cuda cores”), and can execute max. 768 threads.

with blocksize 8 there are 88=64 threads executed at once, leads to an occupancy of 8,3% → BAD!
with blocksize 64 there are 864 = 512 threads executed at once, , leads to an occupancy of 66,7% → actually ok!
with blocksize 96 there are 896 = 768 threads executed at once, , leads to an occupancy of 100% → perfect
with blocksize 256 there are 3256 = 768 threads executed at once, , leads to an occupancy of 100% → still perfect
with blocksize 512 there are 1*512 = 512 threads executed at once, , leads to an occupancy of 66,7% → actually ok, but slower than 256!

note: the number of cuda cores/MP and the max. number of threads per MP depend on the Chip Architecture.

max. threads per mp is a number which can be found in the cudadeviceprop structure, the cudacores/mp count depends on the compute capability.

kbam · August 29, 2011, 1:55am

Optimum number of threads is also heavily affected by number of registers each thread will use, so it varies with the application.

Make the blocksize something you #define so you can easily change it and try it with different values (muliples of 32)
In extreme cases can get maximum throughput at with just one block per MP. i.e. occupancy is just a guide to what might be a good size.

Cheers

pidanchen · September 12, 2011, 10:41pm

A quick update, I am reading the book “Programming Massively Parallel Processors”, it gives a good explaination regarding the thread scheduling issue. It discusses the different blockSize effect

Topic		Replies	Views
How to decide the optimal block size in CUDA CUDA Programming and Performance	4	27744	February 15, 2010
CUDA perormances CUDA Programming and Performance	10	7130	January 22, 2008
How to determine the Block Size CUDA Programming and Performance	1	5909	September 4, 2009
Lots of Threads vs. Shared Memory CUDA Programming and Performance	9	8351	February 12, 2008
Grids and Threads question CUDA Programming and Performance	2	4422	August 7, 2007
The choose of grid size and block size CUDA Programming and Performance	8	3410	May 8, 2024
Need help to better understand CUDA structure CUDA Programming and Performance	7	1091	May 17, 2011
General Formula for Thread/Block Ratio CUDA Programming and Performance	1	593	June 2, 2011
Thread Block Size what difference does it make? CUDA Programming and Performance	6	5408	June 3, 2008
Performance in different thread-block schemes CUDA Programming and Performance	5	2349	September 19, 2008

increasing blokSize -> Faster or slower

Related topics