increasing blokSize -> Faster or slower

A newbie question.

For a general image processing application, say the image is WxH. No matter how you determine the blockSize or gridSize, most likely, each pixel processing will be done in one thread. Does it mean that the performance of the algorithm (without considering shared memory access) will be pretty much the same with different blockSize and gridSize? As long as each pixel will occupy one thread.

thanks

If I may quote from a book on CUDA - “The ability of synchronizing with each other also imposes execution constraints on threads within a block. These threads should execute in close time proximity with each other to avoid excessively long waiting times”. Also I did few experiments, based on which I can say smaller gridSize and bigger threadSize is the way to go!

  • D

more exactly:

it will be very slow with blocksize < 16, since it can’t coalesce the memory access,
will get faster, until the maximum threads per multiprocessor (e.g. 768 at a 9800GT) is a multiple of the blocksize,
and may get slower again when a too big blocksize leads to a lower occupancy of the multiprocessor.

on a 9800GT this means:

each MP can execute max. 8 blocks (has 8 “cuda cores”), and can execute max. 768 threads.

with blocksize 8 there are 88=64 threads executed at once, leads to an occupancy of 8,3% -> BAD!
with blocksize 64 there are 8
64 = 512 threads executed at once, , leads to an occupancy of 66,7% -> actually ok!
with blocksize 96 there are 896 = 768 threads executed at once, , leads to an occupancy of 100% -> perfect
with blocksize 256 there are 3
256 = 768 threads executed at once, , leads to an occupancy of 100% -> still perfect
with blocksize 512 there are 1*512 = 512 threads executed at once, , leads to an occupancy of 66,7% -> actually ok, but slower than 256!

note: the number of cuda cores/MP and the max. number of threads per MP depend on the Chip Architecture.

max. threads per mp is a number which can be found in the cudadeviceprop structure, the cudacores/mp count depends on the compute capability.

Optimum number of threads is also heavily affected by number of registers each thread will use, so it varies with the application.

Make the blocksize something you #define so you can easily change it and try it with different values (muliples of 32)
In extreme cases can get maximum throughput at with just one block per MP. i.e. occupancy is just a guide to what might be a good size.

Cheers

A quick update, I am reading the book “Programming Massively Parallel Processors”, it gives a good explaination regarding the thread scheduling issue. It discusses the different blockSize effect