General Formula for Thread/Block Ratio

Hi all,

I’m fairly new to using CUDA and while a lot of it seems pretty straightforward (thank goodness for the people who made GPU parallelization this easy!) I feel that I’m in the dark about getting a healthy ratio between blocks and threads (or whether it even matters).

Here’s the general layout of my application-to-be:

Basically, it’s a 1-, 2-, or 3-D fluid dynamics grid solver. So at some point I want a thread to run for each cube in the grid. The order in which the grid cubes are calculated is not important. I’m starting with the 1-D case now which could have either 2 or >70,000 grid cubes, so I don’t want to set an absolute number for number of blocks in the kernel call. Would a reasonable technique be to use a series of IF statements to decide whether I’ll have one thread per block, 128 threads per block, or 512 depending on the total number of threads required?

For all I know this issue is more or less unimportant, but minimizing runtime is the objective, and I’m afraid that since I don’t have much formal CS training, the theory of optimizing hardware usage is somewhat opaque to me.



One hardware-level guideline is to be sure your block size is a multiple of the warp size, which is 32 for all devices so far. The hardware executes instructions on entire warps, not threads, so if there are not enough threads to fill a warp, the CUDA cores will be idle part of the time and your effective throughput will be low.

You should, if possible, design your code so that you can benchmark it with block sizes ranging from 32 up to 512 threads per block, in multiples of 32. Many people find that the optimal number of threads per block is not what they would predict, as it can depend on subtle timing issues.

Generally speaking, 128 to 256 threads per block is a good starting place.