Optimal Execution Configuration ? best choice for grid and block sizes

Hi,

I’m writing a program that need to process a very large array of data that i need to split into a grid and blocks. The size of the array is variable and I would like to find a way to adapt the size of grid and blocks as well as possible.

However, the information about the constraints and requirements to design an optimal “execution configuration”, i.e. sizing the blocks and grid, are quite scattered in the CUDA manual (v2.1). So I’ve been trying to put together this information below.
Those conditions should allow us to automatically resize a grid and blocks with a good usage of threads while avoiding runtime “invalid execution configuration” errors.


The variable “deviceProp” contains the device information from calling the function cudaGetDeviceProperties().

Block constraints:

- Max nb of threads total  :
        dimBlock.x *  dimBlock.y *  dimBlock.z <= deviceProp.maxThreadsPerBlock
- Max nb of threads per dim:
        dimBlock.x <= deviceProp.maxThreadsDim[0]
    dimBlock.y <= deviceProp.maxThreadsDim[1]
        dimBlock.z <= deviceProp.maxThreadsDim[2]
- Recommended: 
        dimBlock.x = kx * half-warp (typ. 16), where kx,ky,kz are integers
        dimBlock.y = ky * half-warp (typ. 16)
        dimBlock.z = kz * half-warp (typ. 16)

Grid Constraints:

- Max nb blocks per dim:
        dimGrid.x <= deviceProp.maxGridSize[0]
        dimGrid.y <= deviceProp.maxGridSize[1]
        dimGrid.z <= deviceProp.maxGridSize[2]
- Recommended: 
        dimGrid.x * dimGrid.y * dimGrid.z >= 2 * deviceProp.multiProcessorCount

I would appreciate if anyone had any addition and/or correction to bring to that summary.

Besides, I have a question:

It seems that grid is limited PER dimension by not by total nb of blocks. So I wanted to used a grid of maximum size for my device i.e. 65535x65535
However, the following execution configuration, crashed my PC even though the kernel test_krnl() had an empty body and no parameter.

dim3 dimBlock(16,16);
dim3 dimGrid(65535,65535);
test_krnl<<<dimGrid,dimBlock>>>();

There wasn’t any error message (such as invalid execution configuration) but I was wondering whether it crashed because there grid being too big, the GPU didn’t respond back to Windows XP quickly enough and crashed. Any idea?

Thank you in advance,

Julien

For your block size, use the CUDA Occupancy calculator to determine the optimum size, then use the size of your data and the block size to determine the grid size.

About the crash, I believe that your kernel is taking too long to execute, and is triggering the watchdog timer.

I would also like to add a recommendation to your list:
dimBlock.xdimBlock.ydimBlock.z == k*64; where k is an integer, but this shouldn’t really be a concern, considering you are dealing with huge data sets.
You also want your occupancy your be as high as possible, and preferrably:
dimGrid.x * dimGrid.y * dimGrid.z = k * ThreadBlocksPerMultiprocessor(from the occupancy calculator) * deviceProp.multiProcessorCount

  • Occupancy isn’t everything
  • There are always exceptions to the “rules of thumb” on block sizes
  • There are too many competing factors to predict the performance vs block size
  • In short, the only way to know you are getting optimal performance is to benchmark the kernel at all block sizes and choose the fastest. Then choose a grid to match your dataset based on that block size.

More discussion and benchmarks to back up these claims here: [url=“http://forums.nvidia.com/index.php?s=&showtopic=88437&view=findpost&p=500352”]http://forums.nvidia.com/index.php?s=&...st&p=500352[/url]

Of course, but it’s a good starting point. I’ve found that in most cases, having enough blocks to evenly fill all MPs yields higher throughput than having some blocks leftover trying to finish processing the data.

That, and making sure the kernel is as fast as possible. I’ve found huge performance differences on just the way a loop is set up in the kernel (while vs for, unrolling, etc).